A Federated Search Primer – Part II of III

This post was written by Darcy Katzman on February 20, 2009
Posted Under: Federated Search

This post re-published with permission from AltSearchEngines.com.

Here is Part I

A Definition of Federated Search

While not everyone agrees on all details of what federated search is, here’s a fairly well-accepted definition:

Federated search is the process of performing a simultaneous real-time search of multiple diverse and distributed sources from a single search page, with the federated search engine acting as intermediary.

This definition is a mouthful, isn’t it? Let’s look at the key words in the definition and their influence on the value of federated search:

* federated – Content is combined from different sources saving the effort of searching sources one at a time.
* simultaneous – Federated search queries all user-selected sources at once. It would be unacceptably slow if it waited for all of the results from one source before querying the next.
* real-time – Federated search occurs live and results are current. There’s no stale content.
* multiple – The value of federated search to the researcher increases as the number of sources increases.
* diverse sources – Federated search engines typically can search sources containing documents of different types, e.g. PDF, Word, Powerpoint. The process of extracting text from documents of different types is hidden from the user.
* distributed sources – Federated search engines expect to search content that lives in different locations.
* single search page – Federated search engines provide a single point of searching.
* federated search engine acting as intermediary – The federated search paradigm is such that the user doesn’t communicate directly with the content sources when performing searches. The user submits a search to the federated search engine which, in turn, submits the search to each of the content sources. Each content source provides its results to the federated search engine which combines all of the results from all the sources into a single page of results.

Note that federated search was developed independently of the Web, and therefore federated search engines need not be Web-based.

What’s in a Name?

Federated search goes by a number of different names.
Metasearch (or meta-search), distributed search, directed search, broadcast search, deep web search, cross-database search, and universal search are often, but not always, used synonymously with “federated search.” Metasearch is a term that is often used to refer to a search engine that searches other major search engines. Dogpile, for example, is dedicated to searching the three big search engines: Google, Yahoo!, and MSN. Some would argue that metasearch engines aren’t federated search engines because, even though they search the underlying search engines in real time, the underlying search engines may not have the most current information since they themselves are “crawlers.”

Other Important Features of Federated Search

Three additional features are highly desirable, but not part of everyone’s definition of a federated search. They are aggregation, ranking, and de-duplication (or “dedup’ing”), defined as follows:

Aggregation –
Aggregation is the process of combining search results from the different sources in some helpful way. A federated search engine might present all of the results from one source then, beneath those results, present the results from the next source, and so on. Aggregation may incorporate sorting (e.g., by date, title, or author), or it may involve ranking, also known as relevance ranking.

Ranking – A researcher searching a couple of dozen sources via a federated search engine usually wants to know which results are most relevant to his or her search from among all of the sources. Relevance ranking compares results from all sources against one another and displays the results in order. Surprisingly, not all federated search engines rank their results. This is largely because ranking is difficult to perform well.

De-duplication – A federated search engine may retrieve the same result or document from multiple sources. Users are not interested in seeing duplicate results, yet it turns out to be difficult to remove duplicates effectively. Two documents may have the same title and author, but might actually be different revisions of one document. How does the federated search engine decide which document, or documents, to return? Like ranking, de-duping is a challenge.

Concluded in Part III tomorrow.

Questions? Leave a comment for Sol.

solSol Lederman is the primary author of the Federated Search Blog, a blog sponsored by Deep Web Technologies and dedicated solely to the federated search industry. He also writes for the U.S. Department of Energy’s Office of Scientific and Technical Information (OSTI) Blog, primarily covering OSTI’s accomplishments and technologies. Sol’s first love is mathematics; he enjoys giving away prizes to people who can solve math problems that he presents through his personal blog, Wild About Math!.

You can read his series on Federated Search on AltSearchEngines here for Part I and here for Part II, and here for Part III.

Add a Comment

required, use real name
required, will not be published
optional, your blog address