DigitalGovernment.org - Home of the Nat'l. Science Foundation Digital Government Research Program
menu 1
menu 2
menu 3
menu 4
   

dg.o Web

DG Researchers Re-Tool a Venerable Prototype
Study Explores Data-Mining Techniques as Ways of Refining and Focusing FedStats' Web Search
By Karen Heyman
For the DGRC

Exploring "Federated Search"
  Researchers:
Jamie Callan
Marshall DeBerry

Project
Project home page
DG project profile
FedStats.gov



The historic prototype of digital government projects may well have been the 1890 Census: The US government, still in shock from the seven years it had taken to tabulate the 1880 census, turned to a punch card sorter invented by Census Bureau statistician Herman Hollerith. The machine cut the time to two and a half years and saved the government $5 million, according to the Web site of the company Hollerith later founded, now called IBM.

The challenges of handling large amounts of data continue to drive innovation at the U.S. Census Bureau and other statistical agencies, according to Marshall DeBerry, Program Manager of the FedStats site. Now, Fedstats is seeking help from a DG research project - "A Language-Modeling Approach To Metadata for Cross-Database Linkage and Search" - which is exploring data mining techniques as a way of consolidating and distilling search results for better-focused searches.

"The statistical agencies have always been in the forefront of information technologies. We've always confronted the issue of how to get data in and out quickly, and have had to be innovative in ways of addressing that problem," DeBerry explains.

One of the most impressive of the contemporary uses of information technology, FedStats is the federal government's one-stop shop for every imaginable kind of government statistic. "In the late 90's, in a lot of ways the government was way ahead in putting out information," says DeBerry, "FedStats was one of the first portals."

But being at the forefront in IT has well-known drawbacks, DeBerry acknowledges: "Anything you do with IT eventually becomes a legacy environment. You have to struggle against that constantly, the thing that makes your life so much easier in the beginning, several years down the road can become a millstone."

FedStats, which consistently wins awards like PCMagazine's "101 Incredibly Useful Sites" is now addressing the legacy environment challenge. It was originally designed in 1995, going public in 1997, which means it preceded the present era of advanced search algorithms. FedStats' search engine currently runs on an open source program, with only the barest of customization.

Seeking a better way to ensure that searches could be weighted to return the most relevant results first, DeBerry enlisted the help of Digital Government researcher Jamie Callan, associate professor of Computer Science at Carnegie Mellon University.

Callan's work focuses on uncovering what's been dubbed "The Invisible Web," or "The Hidden Web," the millions of pages of documents that conventional search engines are unable to access. Most publicly available search engines use software that "crawls" through pages to create keyword indexes that they can later check when a user types a search query.

But two kinds of Web sites can stop these spiders at the door: Those that employ specific codes to exclude them and those that require completing forms. The spiders simply record, they are not capable of responding to questions. If a Web site requires a password or the use of its own internal search engine to proceed, otherwise remarkable software like Google's is stopped cold.

In order to find documents hidden in the Invisible Web, Callan is working to develop "federated search" technology, a reference not to its employment on FedStats, but to the notion of searching through the results produced by multiple search engines with one tool. "We've developed a technique for automatically sampling the contents of large databases that require individual queries," says Callan.

It's a statistical sampling method, not unlike opinion polling, he explains, "The software sends random queries to a site, looks at the documents that come back, and then selects some words from those documents and feeds them back in." The process is repeated until the program has determined the subject of a site. "It hones in on the content pretty quickly; after we've seen 300 documents from a site, we know as much about the general subjects covered by the site as if we'd see the entire site," says Callan. "For example, you don't need to see that many Wall Street Journal articles to figure out that it covers business, finance, and politics."

Latest DG News


dg.o 2006 Convenes May 21-24, 2006  
dg.o 2006 Early Registration Ends April 10th!
dg.o 2006 Issues CFP - Tutorials
dg.o 2006 Issues CFP - Workshops
• dg.o 2006 features Workshops on:
   eRulemaking
   GeoInformatics
• dg.o 2006 features Tutorial on:
   •Social Network Analysis
New DG Team Pursues eRulemaking
IEEE ISI2006 Convenes May 22-24, 2006
eChallenges e-2006 Issues CFP
DG Research Helps Predict Urban Growth
Swapping Secrets of the Double Helix
UK and DO-Wire Launch e-Gov Best Practices wiki
DG Team Develops "Virtual Agora" for e-Gov
Mapping for Times of Crisis
Exploring Detection of Crisis Hotspots
Report: Mass eMail Campaigns Harmful
Scenario-Based Designs for Stat Studies
US, EU Explore Info Integration
DG Team Develops Digital Interpreter
DG Study Gives Teeth to FBI
Research Smooths Road for Small Businesses
DG Researchers Parsing in Tongues
e-Gov Journal Issus Call for Articles

See all news stories

Contribute to dgOnline

The contents of each site are represented by a word histogram, which just lists the words that occur on the site and how often each one occurs. The program does not use a built-in dictionary or complex ontology. "The language of the site becomes its own metadata," says Callan. "The software can tell if it's a medical site for professionals or for laypeople by the words that it uses."

When a person submits a query, the program matches it against the word histograms it has collected to find the sites that cover that topic. It then sends the query to the search engines at those sites, collects the results returned by each engine, and merges them into a single, integrated list of search results. By working through the search engines at each site, this approach is able to bring back pages that are invisible to conventional search crawlers.

FedStats, with its thousands of returns from dozens of government databases, is an ideal site to employ Callan's software.

There are several major improvements over the current search system: The software can search within government databases to find documents that would have remained Òhidden,Ó due to the limitations of crawler-based search. In addition, the current system relies on caching, so that returns are just snapshots in time, not necessarily the most updated version of a page. CallanÕs software allows a real-time search that would return up-to-the-minute information. Finally, it allows for better-ranked results, so that users will see the most relevant returns at the top. Currently, Callan and DeBerry are using a working prototype to test the softwareÕs application to FedStats. The aim is to incorporate the new search into FedStats sometime this summer. DeBerry hopes it will become another in a long, successful line of statistical agency technologies that started with Hollerith's punchcards. "This is a historical progression, we've tried to always look ahead at how we can best address the needs of the larger community of citizens and taxpayers. Working with Jamie, we are able to tap into the current thinking on searches, which I hope makes the site better for users."