Dr. Massimiliano Albanese
University of Maryland, College Park

INFORMATION EXTRACTION AND INTEGRATION (HONR229C)

Spring 2009, Tuesday/Thursday, 3:30-4:45 p.m.

In recent years, researchers in diverse fields have been confronted with the increasing availability of information in the form of natural language texts. The increased accessibility to textual information has led to a corresponding interest in technology for automatically processing this often-overwhelming quantity of text in order to extract relevant information. The demand for a technological solution has stimulated the development of Information Extraction and Information Integration technologies.

Information Extraction (IE) is the emerging Natural Language Processing (NLP) technology aimed at processing unstructured, natural language text, in order to locate specific pieces of information or facts in the text, and to use these facts to fill a database.

Unlike Information Retrieval (IR), which finds relevant documents from a collection of documents and presents them to the user, IE analyzes text and presents only the specific information the user is interested in. Today's search engines (Google, etc.) are mostly based on an IR approach: they return relevant documents, not answers to questions. Information Extraction might thus be the key to a new generation of search engines. On the other hand, Information Integration (II) refers to a number of methodologies to reason with data taken from multiple sources and integrate them into a coherent view. Together these techniques enable automation of the tremendously challenging task of deriving structured information from text, and relating it to previously known facts.

The number and types of Information Extraction applications are continuously increasing as more and more scientists from the most diverse disciplines are turning to Information Extraction tools to support and simplify their work. Researchers in Political Sciences, Criminology, Homeland Security, Biology, and many other disciplines are now working more closely with computer scientists to make IE tools more effective and tailored to their actual needs, thus contributing to turn IE into a truly interdisciplinary field. The list of possible scenarios where the tools provided by research in IE are in high demand is virtually endless. As an example, political scientists need to derive - from daily news reports or other sources - a number of indicators to monitor the interactions among political groups and the relationships among different nations. Defense analysts have a continuous and pressing need to analyze huge amounts of documents and extract information that may help to better understand the behavior of terrorist groups and anticipate their actions. In a completely different field, molecular biologists need to quickly analyze vast amounts of scientific publications looking for protein-protein interactions that may have been discovered by their colleagues in other universities.

This seminar will provide an overview of the problems addressed in Information Extraction and Integration, current solutions, current and future applications, and will assess the state of the art and its potential for future progress.

We will discuss a number of sub-problems in IE, including multi-linguality, syntactic parsing, named entity classification, co-reference resolution, segmentation of text streams, classification of segments into fields, association of fields into records, and clustering and de-duplication of records. We will then survey a variety of techniques that have been used to solve these problems, including use of finite state machines and context-free grammars, generative and conditional models, rule-learning and Bayesian techniques. Finally, we will present different exciting applications that embed IE as a major component and students will be given the opportunity to get hands-on experience with the prototype of a real IE system.

Grading will be based on short quizzes (once every two weeks), a short presentation and a final paper. Short quizzes will cover topics discussed in the previous two weeks, and will be a useful tool to asses progress of each student. The short presentation will be given towards the last week of class: student will team into groups of 3-4, will investigate a topic similar to the class themes, and will finally present their findings. The final paper will cover one of the sub-problems in IE presented in the seminar, analyzing the possible real world scenarios were a solution to that problem is highly desired.

Reading will include various research papers and online resources provided by the instructor during classes. Sample readings: Information Extraction by J. Cowie and W. Lehnert; T-REX: A System for Automated Cultural Information Extraction by M. Albanese and V.S. Subrahmanian.

Prerequisite: Students who enroll in this course should be familiar with terms such as search engine , database , database record , algorithm , but no detailed or specific knowledge is required.

CORE: Interdisciplinary and emerging issues [IE]