Art and Science of De-duping Results

This post was written by Darcy Katzman on August 27, 2010
Posted Under: Features,Federated Search

One of the critical problems in federated searching is de-duplication of results. Many sources contain the same journal articles and, clearly, presenting the same result multiple times isn’t useful to users. To solve this thorny problem, Deep Web Technologies has taken a flexible and configurable approach to de-duplication. The Explorit application de-dupes on multiple fields to ensure that users don’t see duplicate results across multiple source. The application has conditional logic which compares various fields to see if the results would be considered a duplicate.

The de-duplication mechanism can use multiple fields that can be compared using boolean logic. This means that various fields can be matched (field A OR field B) or (field C AND field D) using boolean operators. If either condition is true, the result is determined to be a duplicate and removed according to the source de-dupe order. Additional fields can be added to the mix to improve accuracy for example, (field A + field N OR field B + field D).

De-duplication order specifies which sources take priority over other sources. This order can be controlled allowing customers to specify the source order for de-duplication. Sources lower on the de-dupe list which have results determined to be duplicates will have those removed from the list. You could look at de-duplication order this way: you have source A, source B, and source C in your federated search application. Source A has a de-dupe order of 1 (this means this source’s results will be the highest priority). Source B has a de-dupe order of 10, and source C has a de-dupe order of 5. This means if the same result is in both source B & C, the result from source C will be displayed. If the same result is in Source A, only that result will be displayed.

We have found the following de-duplication most effective for our federated search applications. The application first checks the full text URL of the result. If two results have the same full text URL, then it’s assumed that they are duplicates, the application will then not display the results from the source with the higher priority de-duplication order. Next we check a combination of the title of the article and the publication date. If these two fields match, the results are considered duplicates and the lower priority results are removed.

Finding the right balance of fields for de-duplication in a federated search application can be difficult, but Deep Web Technologies has the capability and the knowledge on what fields are best for your sources and your search needs.

Add a Comment

required, use real name
required, will not be published
optional, your blog address