Higher Degree by Research Candidate
School of Mathematical Sciences
Faculty of Sciences, Engineering and Technology
With the emergence of large volumes of data, there is growing interest on collecting and analyzing data. Large volumes of digital data contain information about us and our surroundings gathered by organizations and governments. Data varies from financial data such as shopping transaction data to health records, facial recognition data, GPS-location data, phone call records to emails, tweets, blog post and many more. Diverse data sources, different formats and the streaming nature of them have created inimitable security risks and the quality of the gathered data has led to barriers in accurate data analytics. Erroneous, missing and outdated data reduce the quality of data and result in low accuracy data mining or data analysis. In fact, accurate data analytics is useful and important in many ways such as providing the competitive edge to enterprises, fraud detection, government services, healthcare, business analytics, national security and almost every field we engage in our day-today life.
It has been recognized that integrating data coming through different sources will improve the quality of data leading to more accurate data mining and analysis. Data integration could be used to impute missing values, enrich data and to identify conflicting data . Record linkage is one major challenge of data integration which aims to identify all records that refer to the same real world entities in two or more databases. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem . It has considered as a promising task of quickly and accurately identifying same entity from one or more data sources. It comes with three main challenges: Linkage Quality, Scalability and Privacy and Confidentiality. Approximate matching and accurate classifications required to maintain the quality of the record linkage. Since detailed comparisons requires extensive similarity functions, scalability has been a challenge in record linkage, especially in performance. When linking data sources across organizations, privacy of these information needed to be carefully considered. Thus, Privacy and confidentially makes a big challenge in the process of record linkage.
Privacy and confidentiality is a serious issue to be considered when working with personal information across different organizations and databases. Since individuals value their privacy, organizations have a liability to protect the data they handle. Some databases contain highly sensitive information about individuals or organizations in fields such as healthcare, finance and defense which need to be kept secure. Privacy and confidentially in such context holds individual’s personal autonomy, respect, trust, reputation and social boundaries where identifiable data would make security breaches in a record linkage process. Therefore, Privacy Preserve Record Linkage is concerned with records that correspond to the same entity across several databases while not compromising the privacy and confidentiality of these entities.
Data integration combines technical and business processes with combined data from different sources into meaningful information. Most of the existing PPRL techniques have been developed to be used with static data which are self-contained, enclosed and do not change over time. But dynamic or streaming data which can be delivered in real-time is also critical in industries such as banking, telecommunication and defense. Dynamic data changes continuously: data collected from sensors, web feeds, stock pricing changes are continuously and constantly arriving in large volumes. This data is distributed over different environments and carries the dimensions of Volume, Variety and Velocity which magnifies the attached security risks with its streaming nature. The faster the data is streaming the harder it is to conduct linkage and analysis. This has motivated many new possibilities of research and industrial applications in this area of data integration using record linkage concerning privacy and confidentiality . Thus, dynamic nature of the data has required adaptive systems in record linkage since they stream at an organization in real time.
Year Citation 2021 Herath, S., Roughan, M., & Glonek, G. (2021). Generating name-like vectors for testing large-scale entity resolution. IEEE Access, 9, 145288-145300.
— Herath, S., Roughan, M., & Glonek, G. (n.d.). High Performance Out-of-sample Embedding Techniques for Multidimensional
— Herath, S., Roughan, M., & Glonek, G. (n.d.). Em-K Indexing for Approximate Query Matching in Large-scale ER. — Herath, S., Roughan, M., & Glonek, G. (n.d.). Simulating Name-like Vectors for Testing Large-scale Entity Resolution.
Year Citation 2020 Herath, S., Roughan, M., & Glonek, G. (2020). Landmarks-based blocking method for large-scale entity resolution. In Proceedings of the 2020 IEEE 7th International Conference on Data Science and Advanced Analytics, DSAA 2020 (pp. 773-774). online: IEEE.
Connect With Me