Avi Rappoport on Federated vs. Aggregated Search Architectures

The Federated Search Blog brings an important set of observations about search to our attention. Search pundit Avi Rappoport makes it clear that it’s not so clear that discovery services are the holy grail of search nor the answer to everybody’s search problem. Search is complicated. As Sol Lederman writes in the Federated Search Blog article, “One size does NOT fit all.”

In May, search consultant Avi Rappoport delivered a presentation at the Enterprise Search Summit: Federated vs. Aggregated Search Architectures.

Avi Rappoport is an enterprise search consultant, helping companies improve search engine functionality for websites and intranets. She has a degree from UC Berkeley’s (then) School of Library and Information Science and spent 10 years in software development before becoming a search consultant. She is the editor of SearchTools.com and a frequent speaker and author, providing a strong focus on search usability in the broadest sense and sharing her conviction that search engines can always be better.

Avi created a web page with a summary of and links to a couple of versions of her presentation. I greatly appreciate Avi’s consideration of the pluses and minuses of federation aggregation (i.e. discovery service) in a world that is often polarized about one approach being better in all cases.

My research for this presentation indicated that each is useful in specific circumstances (I know, no surprise there). Many data sources are obviously best accessed by one or the other, but it’s the corner cases that are tricky. Aspects to consider include:

  • size of the content in the source
  • how often your users need that content
  • content change rate
  • importance of real-time access control permissions changes
  • content licensing rules
  • available tools for indexing / querying
  • difficulty of extracting and indexing
  • quality of the internal search engine
  • difficulty of sending queries and receiving results

The final slide has some sage advice:

Be open-minded, analyze the benefits of each approach for each data source.

One size does NOT fit all.

Art and Science of De-duping Results

One of the critical problems in federated searching is de-duplication of results. Many sources contain the same journal articles and, clearly, presenting the same result multiple times isn’t useful to users. To solve this thorny problem, Deep Web Technologies has taken a flexible and configurable approach to de-duplication. The Explorit application de-dupes on multiple fields to ensure that users don’t see duplicate results across multiple source. The application has conditional logic which compares various fields to see if the results would be considered a duplicate.

The de-duplication mechanism can use multiple fields that can be compared using boolean logic. This means that various fields can be matched (field A OR field B) or (field C AND field D) using boolean operators. If either condition is true, the result is determined to be a duplicate and removed according to the source de-dupe order. Additional fields can be added to the mix to improve accuracy for example, (field A + field N OR field B + field D).

De-duplication order specifies which sources take priority over other sources. This order can be controlled allowing customers to specify the source order for de-duplication. Sources lower on the de-dupe list which have results determined to be duplicates will have those removed from the list. You could look at de-duplication order this way: you have source A, source B, and source C in your federated search application. Source A has a de-dupe order of 1 (this means this source’s results will be the highest priority). Source B has a de-dupe order of 10, and source C has a de-dupe order of 5. This means if the same result is in both source B & C, the result from source C will be displayed. If the same result is in Source A, only that result will be displayed.

We have found the following de-duplication most effective for our federated search applications. The application first checks the full text URL of the result. If two results have the same full text URL, then it’s assumed that they are duplicates, the application will then not display the results from the source with the higher priority de-duplication order. Next we check a combination of the title of the article and the publication date. If these two fields match, the results are considered duplicates and the lower priority results are removed.

Finding the right balance of fields for de-duplication in a federated search application can be difficult, but Deep Web Technologies has the capability and the knowledge on what fields are best for your sources and your search needs.

Reminiscing on a 12-Year Partnership with OSTI

This afternoon, I put aside an hour from yet another hectic day to read Dr. Walter Warnick’s article, “Federated Search as a Transformational Technology Enabling Knowledge Discovery: the Role of WorldWideScience.org.” This article by Dr. Warnick–or Walt to me–presents a wonderful overview of OSTI’s mission dating all the way back to 1947. OSTI (Department of Energy Office of Scientific and Technical Information), originally known as the Technical Information Division, was tasked with collecting and disseminating the wealth of non-classified research from the Manhattan Project.  Having lived in Los Alamos the past 15 years, where development of the atomic bomb took place, I’m very familiar with the history of the Manhattan Project and the reasons behind the creation of OSTI. Nevertheless, I found Walt’s article to be an informative and insightful read that provided a unique insider’s perspective.

Dr. Warnick talks quite a bit about the OSTI corollary, which asserts that accelerating the diffusion of scientific knowledge will accelerate the advancement of science.  In the 12 years that I have known him, it has been Dr. Warnick’s singular goal to do everything in his power to increase the speed of scientific discovery.  I know Walt to be a trail-blazer, highly respected among federal government employees in his dedication and leadership at OSTI.  He has made major strides towards making science more accessible to “science-attentive” citizens, researchers and students.

The article focuses on the major role played by OSTI in championing, supporting and adopting federated search, which is the enabling technology for WorldWideScience.org, Science.gov, DOE Science Accelerator and other sites developed and maintained by OSTI. Deep Web Technologies has benefitted greatly from our 12-year partnership with OSTI, who has supported the development of the Explorit federated search technology, motivated us to keep pushing the boundaries of federated search capabilities and been an eager early adopter of our products.

In my next blog article,  I will be highlighting a few of the many accomplishments achieve through our partnership with OSTI, so please stay tuned.

Less Books, not Bookless

As the Stanford Engineering Library nears the completion of its move into new facilities, so does its transition from a print-based library into an econtent-based one. According to an article published by the Library Journal, the library has removed more than 85% of its print collection (about 98,000 books and journals) to offsite storage facility. In addition to e-books, the library is going electronic in other ways as well. New iPhone apps, digital bulletin boards, touch-screen kiosks, and an improved online course management system will all help to enhance a student’s library experience. Furthermore, students will be able to access a growing body of scientific databases and ebooks through xSearch, which was co-developed by Deep Web Technologies and Stanford. Seen as part of the phenomenon of “bookless” libraries, the digitization of the extensive Stanford Engineering Library collection testifies to the increasing integration of technology into academic research. While some resistance to this shift in the library experience does exist among users, there is no denying that the digital age has dramatically altered the role of the library in the educational process.

Breaking Down the Language Barriers

Photo credit: Jakke Nikkarinen/STT Info Kuva Pictured, from left, Dr. Walter Warnick, U.S. Department of Energy Office of Scientific and Technical Information (OSTI) Director; Yuri Arskiy, All-Russian Institute of Scientific and Technical Information (VINITI) Director; Tony Hey, Microsoft Research Corporate Vice-President; Richard Boulderstone of the British Library and the WorldWideScience Alliance Chairman; and Wu Yishan, Institute of Scientific and Technical Information of China (ISTIC) Chief Engineer.

It was an honor to attend and for my company to have played a key role in the launch of multilingual WorldWideScience.org in Helsinki this past June 11th. Beginning more than three years ago, the R&D effort that ultimately resulted in the launch of our ground-breaking multilingual federated search capability involved plenty of hard work by lots of folks at Deep Web Technologies. It certainly could not have been accomplished without our invaluable partnerships with the Department of Energy Office of Scientific and Technical Information (OSTI), the WorldWideScience Alliance, and Microsoft Research. Read More…

Deep Web’s own Abe Lederman Speaks on Next-Gen Searching

In a follow-up to a previous article, Barbara Quint, the editor-in-chief of Searcher: The Magazine for Database Professionals and columnist for Information Today, interviewed Deep Web Technologies’ President and CEO Abe Lederman about the promises and challenges of federated search. Check out the excerpt below for a look at this enlightening conversation regarding the future of information retrieval.

Barbara Quint: So let me ask a basic question. What is your background with federated search and Deep Web Technologies?

Abe Lederman: I started in information retrieval way back in 1987. I’d been working at Verity for 6 years or so, through the end of 1993. Then I moved to Los Alamos National Laboratory, one of Verity’s largest customers. For them, I built a Web-based application on top of the Verity search engine that powered a dozen applications. Then, in 1997, I started consulting to the Department of Energy’s Office of Science and Technology Information. The DOE’s Office of Environmental Management wanted to build something to search multiple databases. Then, we called it distributed search, not federated search.

he first application I built is now called the Environmental Science Network. It’s still in operation almost 12 years later. The first version I built with my own fingers on top of a technology devoted to searching collections of Verity documents. I expanded it to search on the Web. We used that for 5 to 6 years. I started Deep Web Technologies in 2002 and around 2004 or 2005, we launched a new version of federated search technology written in Java. I’m not involved in writing any of that any more. The technology in operation now has had several iterations and enhancements and now we’re working on yet another generation.

BQ: How do you make sure that you retain all the human intelligence that has gone into building the original data source when you design your federated searching?

AL: One of the things we do that some other federated search services are not quite as good at is to try to take advantage of all the abilities of our sources. We don’t ignore metadata on document type, author, date ranges, etc. In many cases, a lot of the databases we search – like PubMed, Agricola, etc. – are very structured.

BQ: How important is it for the content to be well structured? To have more tags and more handles?

AL: The more metadata that exists, the better results you’re going to get. In the library world, a lot of data being federated does have all of that metadata. We spend a lot of effort to…

For the full interview, please visit the DCLnewsBlog.

Hot Tubs, Special Relativity and Subjective Time

Last night, I treated myself … no, I indulged myself … to a 30 minute hot tub under the stars. Alone, in my backyard, I stood looking up at the moon, and I was really struck by the contrast sitting in front of me: The moon, sitting motionless above me and my hot tub, travels around the Earth as fast as a speeding bullet (i.e. the mean velocity of the moon around the Earth is approximately 1km/s and velocities of rifle bullets range from .37 to 1.2 km/s).

If you really think about it, these two facts appear to be mutually exclusive. How can the moon sit motionless above me, yet at the same time travel as fast as a speeding bullet??!? Anyone trained in basic physics, science or mathematics has the answer, of course (See An Empirical Explanation of the Speed-Distance Effect). However, it helps to illustrate a fundamental truism that applies in every facet of our lives: Everything, and I do mean everything, is relative. Seemingly contradictory facts, concepts or ideas, can actually coexist or mean the same thing, and is influenced or observed through the lenses of our respective points of view, perspective, context or situation. And likewise, seemingly identical facts, concepts or ideas, can be different, depending on our respective points of view, perspective, context or situation.

We really came to grips with this notion under Einstein’s Theory of Special Relativity, which taught us that time itself could be relative (i.e. time dilation), depending on one’s velocity or proximity to gravitational bodies. Objectively, actual measurements of speed, time or size vary, depending on the speed, distance or proximity to a gravitational body of the observer. And, therefore, one person’s view of the world can be very different from another person’s view of the world, yet both be people can be factually and scientifically correct.

Subjectively, perceived measurements of speed, time and size can also vary, even when they really are the same.

Subjective measurements usually vary when compared to actual measurements, depending on an individual’s point of view, perspective, context or situation. Even assuming a measure of speed, time or size is objectively the same (which, as explained above, isn’t always the case), our individual perceptions are subjective. A retiree may feel they are “going fast,” when behind the wheel of a car, but the teenager behind them thinks they are “going slow.” An interesting article was published several year ago, about the concept of Subjective Time, and how our perceptions of time vary depending on how engaged we are, whether we’re doing something we’re interested in, and other factors. It’s a quick, but thought provoking article.

Subjective time, in the context of user experience on the Internet, is about reducing “boredom points” in user interaction. More fun, less yawn. It’s a powerful concept that is often overlooked, or marginalized, by focusing purely on the speed of a web-based application. In the context of federated search, the latest fad is on discovery services (see Discovering Discovery Services in the Federated Search Blog, which we sponsor). The primary motivator behind discovery services is speed, without an evaluation of individual context and subjective time. As an attorney, I could never rely on a search mechanism that only searched the meta data of articles, as I need to be assured I will find articles containing the specific search terms I desire, not just articles that happen to contain the search term within the title or abstract.

Interestingly, because of my background in federated search (i.e. my context and situation), I understand and appreciate the limitations of discovery services. Most students and professionals do not understand the nuances of true federated search versus a discovery service, and therefore their reliance on either google or a discovery service occurs at their peril unbeknown to them. Inadvertently, the ongoing pursuit to deliver google-like speeds has introduced hidden risks for users.

The incremental results feature of our product (see Scitopia.org for an example), represents a concerted effort to reduce subjective time, while providing access to the original sources for true full-text searching.

This brings me back to my hot tub, contemplating the contrasts of the actual versus the perceived speed of the moon, and how we as individuals make measurements based on our respective points of view, perspective, context or situation. The objective measures are the same, but:

  • 30 minutes of watching the moon feels like an instant to me, but is an intolerable and insufferable bore to my children (subjective time differences).
  • That driver who cut me off the other day was totally unjustified, yet was speeding home to comfort their dying relative (different perspectives).
  • The car the retiree thinks is going fast, is slow by the teenager’s standards (subjective speed differences).
  • Librarians may like discovery services because of their perceived speed, but they can’t guarantee a comprehensive search for the professional researcher, such as an attorney (different contexts).

Clusters That Think

One of the most interesting features of our Explorit search product is our clustering engine, which analyzes results and produces “clusters” that represent a new and powerful way to navigate search results. The true power of these clusters is often overlooked, for they superficially resemble the output generated by the keyword-based systems and fixed taxonomies of other search engines. Our clustering technology, however, is more akin to a document-discovery engine, which provides a significant improvement over the alternatives in the library world.

The Explorit engine provides a unique approach to clustering taken from Latent Semantic Analysis (or LSA). We took a look at some of the traditional methods at taxonomy generation (i.e. learning approaches, semantic knowledge bases, and word nets) and after carefully examining their advantages and shortcomings, we chose latent semantic analysis, and a “description comes first” approach, to provide a rich result analysis tool for customers. LSA is a fully automatic mathematical/statistical technique for extracting and inferring relations of contextual usage of words in search results. This technology provides a concept-based approach to analyzing and clustering results from a result set. Applying the LSA approach, our clustering engine analyzes the relationships between a set of documents and the terms contained within the documents to produce a set of concepts related to the results. In other words, our search engines can generate more sophisticated and nuanced result clusters, which will help to cut down on the time and tries it takes for users to find the desired information.

More Meaningful Searches, Superior Cluster Results

A solid introduction to LSA can be found in the study, An Introduction to Latent Semantic Analysis, by Landauer, Foltz and Laham.

Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text (Landauer and Dumais, 1997). The underlying idea is that the aggregate of all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and sets of words to each other. The adequacy of LSA’s reflection of human knowledge has been established in a variety of ways. For example, its scores overlap those of humans on standard vocabulary and subject matter tests; it mimics human word sorting and category judgments; it simulates word–word and passage–word lexical priming data . . .

This means our clusters, leveraging the concepts behind LSA, actually discover relationships in the results and presents them in a way that mimics the way users actually think. The superior quality of our clusters can perhaps best be demonstrated by comparing them to one of our competitors. Consider a search for “satellite communication”:

Our Clusters
Competitor’s Clusters

As you can see, our clusters (on the far-right) provide far more meaningful results. The top cluster terms provided by our competitor is “studies,” which provides no concrete information about the documents the set contains. Additionally, synonymous terms such as  “United States” and “US” are treated as separate keywords by our competitor, which places demand on the user to then manually sort through to find what they are looking for.  With our LSA based clustering, results tend to be more relevant and more narrowly focused, with stop words removed from the cluster results. A user interested in “satellite communications pointing systems,” for example, can easily find the articles they are looking for with our clusters, while end users of the competition will no doubt have to run another search.

Users Think in Concepts, not Keywords

Our approach utilizes the entire set of search results and performs an LSA-type analysis, which helps reduce the cluster size and provide more granular results. Users can control cluster breadth (i.e. maximum number of top level clusters), cluster depth (i.e. maximum number of hierarchical levels), cluster arrangement (i.e. alphabetically or by occurrence), and cluster size. This means that the type of clustering can be configured to match the data sources in the federated search, narrowing or broadening clusters as desired. Simple keyword-based clustering cannot be customized in these ways. The Explorit approach matches the way that users actually think– which is in concepts, not keywords.

The clusters produced by our search engines can be enhanced and customized by utilizing synonyms (i.e. word aliases), label filtering (e.i. excluding offensive words), label boosting (i.e. promoting terms), and more. At Deep Web Technologies, we can tailor many of these settings per client request to create the best possible user experience for any projects.

Benefits of Explorit’s LSA Based Clusters over traditional taxonomy methods.

  • Clusters that reveal the concepts contained within in the results, not just keywords.
  • Natural language clusters, not keyword snippets
  • Discovery of concepts across disparate collections, journals and ebooks
  • Customization of synonyms for concepts
  • Tailored approach for unique settings (i.e. label filtering, boosting, sorting and more)

Our clustering solution provides capabilities far beyond a simple keyword-based system– it provides significant insight into the result set itself through the use of semantic analysis. This approach allows users to employ Explorit as a true discovery tool, identifying relationships between documents contained across multiple collections and sources. With our “deeper, richer”snippets approach to searching, the deep semantic discovery engine presents users with a more efficient and more powerful way to research.

For more information on our clustering capability and/or LSA, you may be interested in the following studies:

For a real-world view of our clusters in action, you may be interested in one or more publicly available research portals below:

No Alternative to Federated Search

[ Editor's note: This article first appeared in the OSTI Blog and then in the Federated Search Blog. Dr. Walt Warnick, Director of the Office of Scientific and Technical Information, part of DOE, and Sol Lederman co-authored the article. For some important search applications there is no alternative to federated.]

Discovery services have begun to appear in the search landscape. Discovery services provide access to documents from publishers with which they have relationships by indexing the publishers’ metadata and/or full text. Discovery services are marketed to libraries where patrons appreciate near-instantaneous search results and where library staff is willing to restrict access to sources available from the service (and optionally the library’s own holdings.) While these services tout themselves as improvements to federated search, the reality is that there is no alternative to federated search for a number of important applications.

WorldWideScience.org is a global gateway to science. The federated search application was conceived and developed at OSTI and hosted by us. The portal performs live federated search of 70 databases from 66 countries. Participating members provide access to their national research databases. For a number of reasons this important gateway to millions of research documents does not lend itself to the discovery service model.

WorldWideScience.org content is free to the public. Several difficult technical hurdles make it highly impractical to index content from member databases. The first challenge is that most databases will not provide a harvesting mechanism such as OAI-PMH. Without such a mechanism there is no method of predictably harvesting the entire contents of a database. From OSTI’s perspective, it is not acceptable to provide access to only a subset of a scientific collection. Federated search completely avoids this problem by having the source’s search engine query the entirety of its contents.

The second major challenge is that meta data does not exist for documents in many of the databases in WorldWideScience.org. Discovery services rely upon meta data to “homogenize” information about documents that they place in their unified indexes.

A third challenge is that WorldWideScience.org will soon be multi-lingual. While discovery services could pre-translate contents, doing that would be impractical as the volumes are so huge and constantly expanding.

A fourth challenge to indexing all of the content from WorldWideScience.org is that the science portal federates portals which themselves are federated search applications. These challenges make indexing and packaging the contents of WorldWideScience.org so expensive, difficult, and time consuming that no organization is likely to do it.

The onerous technical hurdles that would need to be overcome to make content such as that in WorldWideScience.org searchable by a discovery service illuminate the case for federated search. In the federated search model, content providers need only provide a search interface to their database, which they are already providing to their users. Ideally, the search interface is one that lends itself to machine search and retrieval. But even if it is not, in most cases, if a human can search it, a federated search application can be programmed to search it also. Also, federated search does not expect metadata. WorldWideScience.org serves its content owners by eliminating all barriers to participation. Even language translation is not a burden to the database owners. If the member nations sanction a particular database then the burden of inclusion of that database is taken on solely by the vendor that developed and maintains the federated search engine, Deep Web Technologies.

Another advantage of federated search is that applications can be easily integrated with other applications. For example, ScienceResearch.com provides access to a mix of proprietary and open content, such as WorldWideScience. Through our federated search approach, the WorldWideScience.org Alliance maintains autonomy while extending the reach of its materials. Best of all, we do all of this without burdening anyone. In this way we advance our mission of accelerating science.

But don’t take us wrong. We at OSTI would welcome a discovery service which seeks to make DOE material more accessible. OSTI systems are already set up to facilitate such a collaboration. However, the technology of discovery services is less suitable for certain important purposes, like WorldWideScience.org, now fulfilled by federated search.

Walt Warnick
OSTI Director

Sol Lederman
OSTI Consultant

Smart People Love Federated Search

By now most of us are pretty familiar with the “information overload” problem parodied in Bing’s current advertising campaign. In case you’re not, information overload happens when you naively use a popular search engine expecting to find some specific information, like the real-world fuel economy of a used car you’re thinking of buying, and some time later find yourself staring at a picture of two garden gnomes kissing in the back seat of a 2002 Acura RSX. (This is a real photograph, and I saved it as proof.) While “Hey, Look At This Weird Thing Google Found!” has become an actual form of entertainment in my household, conventional search engines can present real problems when used for research.

Federated search is renowned by serious researchers as a way to cut through the garden gnomes and other spurious results by searching only select information sources and by targeting the deep web. This means that an electrical engineer who wants to read up on solar cell fabrication can search for “gallium arsenic” on a federated search site like ScienceResearch.com and quickly uncover the highest-quality information because only science-specific sources are searched, instead of the entire spectrum of the Internet. (See? Serious researchers.) Additionally, at a library that subscribes to electronic resources and uses federated search to access those subscriptions, our electrical engineer would have one-stop access to full text articles that could never be located through a popular search engine.

For example, here’s what happens when I search Google for “gallium arsenic.”

First, Google corrects me. I couldn’t possibly know precisely what I’m searching for, so it changes my search term to “gallium arsenide” without my permission. This is annoying to someone who’s just pretending to be a serious researcher, so I can only imagine what our electrical engineer would be thinking if he needed to find that really great article he stumbled across not too long ago with “gallium arsenic” in the title. The next problem is that the top two results are the Wikipedia entry on gallium arsenide, and a sponsored link from a company that sells manufacturing quantities of gallium arsenide. Well, I already know what it is, and I don’t have room in my garage. There are links to scientific journals lower in the results, but they’re scattered among more commercial sites, a sustainable energy wiki, and a page advertising a conference in Oregon. Sensing that a gnome might pop up at any moment, I click “Search instead for gallium arsenic.” (At least Google offers to search for what I wanted after it’s made up my mind for me.) The same sort of results come up – not bad, but not what I need if I’m going to learn about the different substrates that are used in the fabrication of gallium arsenic cells.

Contrast this with what happens when I search ScienceResearch.com for “gallium arsenic” – or “gallium arsenide,” as the case may be. First, the search is run on the search term I asked for. Calling this a bonus of federated search might seem too much like I’m hard-selling you that used car by saying, “Plus, when you turn the key, the engine starts!” but in comparison to Google’s behavior, it’s an important feature. Now for the results: the top results are from National Institute of Standards and Technology, which just might be a more reliable source than Wikipedia, and following that is a result about “Gallium-Arsenic Substrate Fixture and Substrate Fixing Methods.”

Our electrical engineer is pleased. And with the time he saved by using federated search, maybe now he can have a little fun with Google.