|
![]() |
|
Better Police Work Through Data Mining
Often it seems that those same evil witches are behind the digital revolution: The systems we have designed generate so much data that we are drowning in it, desperate as an enchanted princess trying to separate butter from flour in a baked cake. Of course, there are never any handsome princes to come to the rescue when you need them, some very clever computer scientists are studying ways to not only sift the data, but discover patterns in it that make it easier to interpret, manage and use, while learning more about the users. Prof. Salvatore J. Stolfo and his colleagues at Columbia University are working under a Digital Government grant to create software to help sort through the mess. Called the Email Mining Toolkit (EMT), their system harvests data from email accounts and then computes behavior profiles for the users. Stolfo describes their work as a "much more sophisticated version" of conventional text retrieval software, with more statistical analysis. In the "keyword search" software with which most of us are familiar, you type in one word, or perhaps two or more words, using Boolean operators, like "and" or "or" to find relevant documents. Thus, "Yankees" brings up many documents, but "Yankees and Red Sox," restricts the search to only documents that include the names of both teams. A die-hard Red Sox fan might search on, "Red Sox and Yankees (or Evil Empire)," which will bring up documents with "Red Sox and Yankees" or "Red Sox and Evil Empire." Such searches are effective for the user, but ultimately they are little more than the digital version of an index. Most commercial search applications are simply astraightforward, unintelligent retrieval engines, dutifully combing through thousands of documents. They can neither learn from nor indicate anything about a user's typical behavior. This type of search engine is useful if you're trying to find a letter in your own email inbox, but what if you are a system administrator trying to fight spam? You want to know more about the behavior of the owner of the inbox. What is typical? What is atypical? For example, says Stolfo, people tend to communicate in cliques: we typically send one kind of email to a group of friends, another to your family, and yet another to co-workers. Software that analyzes those habits can flag as suspicious any emails suddenly sent to every one in these socially unrelated cliques - a behavior typical not of human beings, but email viruses. Similarly, if an email account that is known to be active only during the workday start to spew out emails in the middle of the night, it could indicate that the account was taken over by a spambot. This technology can be used to similar effect against criminal suspects. Analyzing when emails are sent, for example, can give a sense of what hours a person is keeping. Looking at patterns of activity can suggest potential co-conspirators. Stolfo is aware of the privacy issues, and emphasizes that per wiretapping laws, use of the software in this way would require a court order to place the software on a client computer or on a server to be run against a mail log. Stolfo and his team worked with the New York City Police Department to understand how a law enforcement agency might employ such a tool. "We learned a lot from them about what their needs were," says Stolfo. Philip McGuire, Assistant Commissioner of NYPD, says it likely will be a useful tool in "the kinds of cases that produce massive amounts of email," such as white collar crimes. Columbia University is in the middle of discussions to license the software, which should make a finished version available to the police department and others. McGuire says he's told Stolfo to keep in touch. Stolfo dubs this the "VIP measure," adding, "You can build the hierarchy of an organization by looking at this behavior." All of these various kinds of behavioral measures can be set in boxes in the graphical interface Stolfo's team has designed. One of the problems of advanced datamining, he points out, is that it requires a knowledge of database structures, like SQL, that generally only professional programmers possess. The toolkit makes the training relatively easy for naïve users. The emails to be searched are first parsed in Java into a relational database. The hope is that the format is generic enough that the data should be readable despite whatever may be developed in the future. And the future is very much on Stolfo's mind. "Digital documents are piling up like paper ones," he says. "There will likely be 100 million emails from this administration alone. How will future historians go through it all?" Stolfo's hopes the EMT will help. Its ability to track changes in behavior, for example, could help researchers spot turning point events. It may not be a handsome prince, but to an archivist, it may just seem like one. | ||||||
|
This site is maintained by the Digital Government Research Center at the University of Southern California's Information Sciences Institute. |
|
CONTACT POLICIES | ||
| | |||||