Paper
6 April 2000 Customer and household matching: resolving entity identity in data warehouses
Donald J. Berndt, Ronald K. Satterfield
Author Affiliations +
Abstract
The data preparation and cleansing tasks necessary to ensure high quality data are among the most difficult challenges faced in data warehousing and data mining projects. The extraction of source data, transformation into new forms, and loading into a data warehouse environment are all time consuming tasks that can be supported by methodologies and tools. This paper focuses on the problem of record linkage or entity matching, tasks that can be very important in providing high quality data. Merging two or more large databases into a single integrated system is a difficult problem in many industries, especially in the wake of acquisitions. For example, managing customer lists can be challenging when duplicate entries, data entry problems, and changing information conspire to make data quality an elusive target. Common tasks with regard to customer lists include customer matching to reduce duplicate entries and household matching to group customers. These often O(n2) problems can consume significant resources, both in computing infrastructure and human oversight, and the goal of high accuracy in the final integrated database can be difficult to assure. This paper distinguishes between attribute corruption and entity corruption, discussing the various impacts on quality. A metajoin operator is proposed and used to organize past and current entity matching techniques. Finally, a logistic regression approach to implementing the metajoin operator is discussed and illustrated with an example. The metajoin can be used to determine whether two records match, don't match, or require further evaluation by human experts. Properly implemented, the metajoin operator could allow the integration of individual databases with greater accuracy and lower cost.
© (2000) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Donald J. Berndt and Ronald K. Satterfield "Customer and household matching: resolving entity identity in data warehouses", Proc. SPIE 4057, Data Mining and Knowledge Discovery: Theory, Tools, and Technology II, (6 April 2000); https://doi.org/10.1117/12.381731
Lens.org Logo
CITATIONS
Cited by 4 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Databases

Data modeling

Binary data

Data mining

System integration

Computer programming

Statistical analysis

RELATED CONTENT


Back to Top