DigitalGovernment.org - Home of the Nat'l. Science Foundation Digital Government Research Program
menu 1
menu 2
menu 3
menu 4
   

dg.o Web

Better Police Work Through Data Mining
Columbia researchers develop email scanning system that could help police spot crime and conspiracy
By Karen Heyman
For the DGRC

Email Mining Toolkit
 

Researcher profile: Salvatore J. Stolfo

Project Profile:
Email Mining Toolkit Supporting Law Enforcement Forensic Analyses

Project Highlight




We all recall childhood fairy tales of princesses who were condemned by evil witches to spin straw into gold or move sand dunes with tweezers.

Often it seems that those same evil witches are behind the digital revolution: The systems we have designed generate so much data that we are drowning in it, desperate as an enchanted princess trying to separate butter from flour in a baked cake.

Of course, there are never any handsome princes to come to the rescue when you need them, some very clever computer scientists are studying ways to not only sift the data, but discover patterns in it that make it easier to interpret, manage and use, while learning more about the users.

Prof. Salvatore J. Stolfo and his colleagues at Columbia University are working under a Digital Government grant to create software to help sort through the mess. Called the Email Mining Toolkit (EMT), their system harvests data from email accounts and then computes behavior profiles for the users.

Stolfo describes their work as a "much more sophisticated version" of conventional text retrieval software, with more statistical analysis.

In the "keyword search" software with which most of us are familiar, you type in one word, or perhaps two or more words, using Boolean operators, like "and" or "or" to find relevant documents. Thus, "Yankees" brings up many documents, but "Yankees and Red Sox," restricts the search to only documents that include the names of both teams. A die-hard Red Sox fan might search on, "Red Sox and Yankees (or Evil Empire)," which will bring up documents with "Red Sox and Yankees" or "Red Sox and Evil Empire."

Such searches are effective for the user, but ultimately they are little more than the digital version of an index. Most commercial search applications are simply astraightforward, unintelligent retrieval engines, dutifully combing through thousands of documents. They can neither learn from nor indicate anything about a user's typical behavior.

This type of search engine is useful if you're trying to find a letter in your own email inbox, but what if you are a system administrator trying to fight spam? You want to know more about the behavior of the owner of the inbox. What is typical? What is atypical?

For example, says Stolfo, people tend to communicate in cliques: we typically send one kind of email to a group of friends, another to your family, and yet another to co-workers. Software that analyzes those habits can flag as suspicious any emails suddenly sent to every one in these socially unrelated cliques - a behavior typical not of human beings, but email viruses.

Similarly, if an email account that is known to be active only during the workday start to spew out emails in the middle of the night, it could indicate that the account was taken over by a spambot.

This technology can be used to similar effect against criminal suspects. Analyzing when emails are sent, for example, can give a sense of what hours a person is keeping. Looking at patterns of activity can suggest potential co-conspirators.

Stolfo is aware of the privacy issues, and emphasizes that per wiretapping laws, use of the software in this way would require a court order to place the software on a client computer or on a server to be run against a mail log.

Stolfo and his team worked with the New York City Police Department to understand how a law enforcement agency might employ such a tool. "We learned a lot from them about what their needs were," says Stolfo.

Philip McGuire, Assistant Commissioner of NYPD, says it likely will be a useful tool in "the kinds of cases that produce massive amounts of email," such as white collar crimes. Columbia University is in the middle of discussions to license the software, which should make a finished version available to the police department and others. McGuire says he's told Stolfo to keep in touch.

Latest DG News


dg.o 2006 Convenes May 21-24, 2006  
dg.o 2006 Early Registration Ends April 10th!
dg.o 2006 Issues CFP - Tutorials
dg.o 2006 Issues CFP - Workshops
• dg.o 2006 features Workshops on:
   eRulemaking
   GeoInformatics
• dg.o 2006 features Tutorial on:
   •Social Network Analysis
New DG Team Pursues eRulemaking
IEEE ISI2006 Convenes May 22-24, 2006
eChallenges e-2006 Issues CFP
DG Research Helps Predict Urban Growth
Swapping Secrets of the Double Helix
UK and DO-Wire Launch e-Gov Best Practices wiki
DG Team Develops "Virtual Agora" for e-Gov
Mapping for Times of Crisis
Exploring Detection of Crisis Hotspots
Report: Mass eMail Campaigns Harmful
Scenario-Based Designs for Stat Studies
US, EU Explore Info Integration
DG Team Develops Digital Interpreter
DG Study Gives Teeth to FBI
Research Smooths Road for Small Businesses
DG Researchers Parsing in Tongues
e-Gov Journal Issus Call for Articles

See all news stories

Contribute to dgOnline

Behavioral analysis can also give a sense of organizational structure - not only who someone is replying to, but in what order and how quickly, e.g., the person whose emails are answered first and fastest is likely to be the boss.

Stolfo dubs this the "VIP measure," adding, "You can build the hierarchy of an organization by looking at this behavior."

All of these various kinds of behavioral measures can be set in boxes in the graphical interface Stolfo's team has designed. One of the problems of advanced datamining, he points out, is that it requires a knowledge of database structures, like SQL, that generally only professional programmers possess. The toolkit makes the training relatively easy for naïve users.

The emails to be searched are first parsed in Java into a relational database. The hope is that the format is generic enough that the data should be readable despite whatever may be developed in the future. And the future is very much on Stolfo's mind.

"Digital documents are piling up like paper ones," he says. "There will likely be 100 million emails from this administration alone. How will future historians go through it all?" Stolfo's hopes the EMT will help. Its ability to track changes in behavior, for example, could help researchers spot turning point events. It may not be a handsome prince, but to an archivist, it may just seem like one.