Ranking: The Secret Sauce for Searching the Deep Web

This post was written by Darcy Katzman on July 17, 2009
Posted Under: Features,Federated Search

One of the most powerful features and benefits of Deep Web Technologies’ Explorit Federated Search, is its ability to rank the results from the myriad of collections that might be included in a federated search (a.k.a. deep web search).

This is useful for two reasons. First, it helps rank results from sources that don’t otherwise rank results. The ever popular PubMed, a service of the U.S. National Library of Medicine and the National Institutes of Health, is an example of this. PubMed doesn’t rank its results. As a consequence, any search service that provides results from PubMed that do rank results, such as our MedNar.com, adds tremendous value.

PubMed Results in MedNar

PubMed Results in MedNar

In the image above, MedNar pulls an inordinate number of results from PubMed (1,000), precisely because PubMed doesn’t rank its results (note that PubMed reports 60,107 results for the search term “alternative treatments”). By including more results from PubMed in its federation, MedNar uses its ranking engine to help add value to the overall results provided by PubMed.

Second, ranking federated search results also distills hundreds — or perhaps thousands — of results into a prioritized list that is tremendously useful to a deep web researcher.

The questions are, what’s the secret sauce, and can one ranking method be better than another?

I’m about to indicate what the secret sauce is for Deep Web Technologies’ Explorit federated search. When comparing deep web search tools, you should really pay attention to ranking, but I’ll let you be the judge of whether one ranking method is better than another.

Deep Web Technologies’ Ranking Algorithm

Ranking is based on relevance, with the most relevant results becoming the highest ranking results. We compute relevance by (1) creating root-words for the query terms and results, (2) conducting relevance weighting for a number of factors, and then (3) use our proprietary algorithms to assign rank for each result in the list of results.

(1) Creating Root-Words: Stemming

Stemming is the process of converting words to their base – or root – words. In the simplest case, it makes sure that a pluralized search term will find singular terms in the results, and visa-versa. This can be simply dropping “s” or “es” from words (in English), but the process can become more complex. Consider “mouse/mice” and “person/people”.

The specific stemming algorithm we use is the Porter Stemming Algorithm.

For the most part, we do not need to stem search terms before submitting them to the collections we search.  Occasionally, we may need to explicitly indicate to a collection that we want to perform a stemmed search or an exact search.

(2) Conducting Relevance Weighting

We analyze search term occurrence within a search result, and assign weights for different factors. We look for occurrence of exact terms and stem terms. We can assign relative weights to different results fields.  We can also assign higher weights to results from a more important collection as well as assign a higher weight to more recent results.

We also consider:

  • Search Term Position – We examine where search terms appear within particular fields (i.e. title, author, snippet) and affording special consideration for whether a search term occupies the first word position, last word position, or relative position to either.
  • Search Term Density – We find significance in how often search terms appear within fields (i.e. individual fields and full record). Aside from counting the number of occurrences of search terms within fields, we consider the ratio of search term length to result field length. For example, a one-word title that is the same as the search term would be highly relevant.
  • Search Term Proximity – We consider how close search terms occur relative to one another. When evaluating this, we look at the number of search terms within the query expression and the distance between reoccurring search terms. In returned results, this ratio, in conjunction with the length of the fields, can be significant.
  • Search Term Ordinality – If search terms are in the same order, as was specified in the search expression, this can be significant and is afforded greater weight than if the search terms are not in order as the search expression. Likewise, multiple occurrences of ordinality are important.

(3) Proprietary Algorithms

Once we’ve analyzed the exact search terms and stemmed search terms, against the factors above and assigned weights, we use our proprietary algorithms to assign an actual rank.

These algorithms operate on the Boolean operators AND, OR and NOT. The search query expression is evaluated from left to right. Exact phrases (contained within double-quotation-marks) are not stemmed!

If a date range is specified, the date is used as a constraining term, provided that a date is supplied in a result. If a date is not supplied in a result, the relevance for that result is assumed zero (i.e. not ranked). Note that such results may still show in the results list.

Finally, stop words are words considered irrelevant for searching purposes. We don’t evaluate them. The current list of stop words is:

a, about, again, all, almost, also, although, always, among, an, another, any, are, as, at, be, because, been, before, being, between, both, but, by, can, could, did, do, does, done, due, during, each, either, enough, especially, etc, for, found, from, further, had, has, have, having, here, how, however, i, if, in, into, is, it, its, itself, just, made, mainly, make, may, might, most, mostly, must, nearly, neither, no, nor, obtained, of, often, on, our, overall, perhaps, quite, rather, really, regarding, seem, seen, several, should, show, showed, shown, shows, significantly, since, so, some, such, than, that, the, their, theirs, them, then, there, therefore, these, they, this, those, through, thus, to, upon, use, used, using, various, very, was, we, were, what, when, where, which, while, who, why, with, within, without, and would.

In Summary

Deep Web Technologies utilizes a strong ranking algorithm, that considers a number of factors and assigns relative weights, to the relationship between the search terms and the results. To some extent, weights can be modified according to client preferences, and in all cases, ranking can add tremendous value to a federated (or deep web) search.

Add a Comment

required, use real name
required, will not be published
optional, your blog address