Modern Data Science of Record Linkage and Deduplication – Data Integrity and Quality

Today’s paper is about the topics of Record Linkage, Data Deduplication and Data Integrity. In our age of Data, where the amount of it being generated keeps growing, it might be important to realize that much of that data is not always good  and practical. What’s more with increasing amounts of data it also increases the chance of errors due to complexity, ambiguity and excessive dimensions in that data. So it is also growing in importance the study of the best methods of how to deal with problems in Data.

On the other hand our so-called precious data must be stored and maintained in some storage or record system, for late retrieval and possible linkage with other data, that may be new data or data residing somewhere else in external databases. This is common in large organizations, like larger enterprises and public services, that need to retrieve and move data around in their different databases and storage systems, but want it quickly and efficiently, minimizing costs and latency. This is where the topics of efficient Record Linkage, Data Deduplication and Data Integrity are of critical importance.

The paper that I show today here concerns the use of Probabilistic methods of Record Linkage  when linking two databases. In this process of linkage of two databases it is common that the number of possible links within the systems to grow rapidly thereby increasing excessively the complexity and dimensions of the databases, calling  the need for practical methods to help in the computing process. The paper describes three such methods, namely indexing, blocking and filtering, but the author proposes a new model to account for particular forms of indexing and filtering, after reviewing the implications of indexing, filtering and blocking under the historic popular paradigm called Fellegi-Sunter framework for inferring linkage structure:

Probabilistic Record Linkage and Deduplication after Indexing, Blocking, and Filtering




When linking two databases (or deduplicating a single database) the number of possible links grows rapidly in the size of the databases under consideration, and in most applications it is necessary to first reduce the number of record pairs that will be compared. Spurred by practical considerations, a range of indexing or blocking methods have been developed for this task. However, methods for inferring linkage structure that account for indexing, blocking, and filtering steps have not seen commensurate development. I review the implications of indexing, blocking and filtering, focusing primarily on the popular Fellegi-Sunter framework and proposing a new model to account for particular forms of indexing and filtering. 

Definitions and framework


Following I sketch the basic definitions and the frameworks outlined by the paper and what subsequently will be proposed as a new model for probabilistic record linkage of databases:

Probabilistic record linkage is the process of merging two or more databases which lack unique identifiers. The related task of detecting duplicate records in a single file can be cast as linking a file against itself, ignoring redundant comparisons. Initially developed by Newcombe et al. (1959); Newcombe and Kennedy (1962), probabilistic record linkage was mathematically formalized by Fellegi and Sunter (1969). In the ensuing decades these methods have been widely deployed, and variations on the FellegiSunter framework still form the backbone of most applications of probabilistic record linkage. (…)

Even if the comparisons themselves are relatively inexpensive to compute and the files are of moderate size, this step can be computationally prohibitive. In most practical applications of probabilistic record linkage it is necessary to eliminate a large number of record pairs from consideration, without making a full comparison of the two records, a process known as indexing or blocking. Storage and other considerations often lead to an additional filtering step, where record pairs that are extremely unlikely to be true matches are discarded after a complete or (nearly complete) comparison has been made. As the size of the files under consideration has increased, the development of strategies for indexing, blocking, and filtering has outstripped the capacity of models to account for them. Using the popular Fellegi-Sunter framework as a guide, this paper discusses the implications of these strategies on subsequent modeling and inference of linkage structure. We propose extensions that provide more relevant and accurate error estimates.

And the Fellegi-Sunter Framework:

The original method for probabilistic record linkage, which is still widely in use, was introduced by Fellegi and Sunter (1969) who formalized earlier developments by Newcombe et al. (1959); Newcombe and Kennedy (1962). See Herzog et al. (2007) for extensive review of the basic framework and extensions

A set of fields are available in both files A and B and may be used to compare records. Often these comparisons take the form of a series of binary variables, which may indicate direct matches on fields (do records a and b agree on gender?), sufficient agreement (is the similarity score between the two name fields greater than some threshold?), or other derived comparisons (do a and b match on month and year of birth?). Let γab = (γab(1), . . . γab(q)) be a binary vector collecting the comparisons between records a and b, taking values in Γ = {0, 1} q . The model for record linkage presented in Fellegi and Sunter (1969) is as follows:

Pr[(a, b) ∈ M] = pM (1)

Pr[γab = g | (a, b) ∈ M] = πg|M (2)

Pr[γab = g | (a, b) ∈ U] = πg|U (3)

Pr[γab = g] = pMπg|M + (1 − pM)πg|U . (4)

The probability distribution of the observed comparison vectors is a two component mixture model, where one component corresponds to true matches and the other to true non-matches. The components πg|M and πg|U are often referred to as “m−probabilities” and “u−probabilities,” respectively (Winkler, 2006b).


Probabilistic Record Linkage after Indexing

Whether implemented by blocking, filtering, or some other process, indexing creates a biased sample of record pairs by design. For model-based procedures (including Fellegi and Sunter (1969)) indexing therefore changes the interpretation of the recovered parameters. Researchers have long noted this fact (beginning at least with Fellegi and Sunter (1969) themselves; related comments appear in Jaro (1989) and Winkler (2006b)). However, the effect of indexing on subsequent modeling of record pairs and inference of linkage structure is often ignored. Using the Fellegi-Sunter framework as a guide we describe some of the implications in a simple special case, comparing g the effects of traditional blocking and indexing by disjunction (including filtering).



Cutting this post short, and referring to further details in the paper, conclusion and main results were as follows:

We have described the effects of indexing, blocking, and filtering on subsequent inference of record linkage structure within the Fellegi-Sunter framework. Explicitly modeling the effects of indexing, and especially filtering, clarifies the interpretation of error rates and enhances their estimation (which also improves decision rules). The effects of filtering in particular will be the greatest when the files lack a small number of highly discriminative fields for use in indexing. This situation seems to be common in practice. Some of the impacts of indexing have been discussed in the literature, but modern applications of record linkage index, block or filter without making subsequent adjustments to the model for record pairs. 


We have focused on three simple but extremely common indexing methods: blocking, indexing by disjunctions, and filtering. A range of other techniques for indexing exist, including sophisticated approaches based on various hashing algorithms (Christen, 2012b; Steorts et al., 2014). These are perhaps best understood as fast approximations to filtering, and our discussion here applies more or less directly (especially if these indexing methods are followed by a subsequent filtering step to remove unlikely pairs that escape indexing).


The implications for supervised record linkage, which utilizes a set of known matching and non-matching pairs to predict unlabeled pairs, are also unclear. The biased sampling due to indexing and filtering may be largely irrelevant since the unlabeled record pairs come from a similarly biased sample. However, indexing concentrates the predictors on a subspace (e.g., Γι in the context of this paper). Perhaps this dimension-reducing effect could be exploited to enhance prediction; Ventura (2015)’s blend of random forests and hierarchical clustering involves some similar ideas in this direction.



Interesting topic in Data Science and Data Engineering.


Featured Image: Access Control in Database Management Systems


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s