Saturday, November 15, 2014

Watson! - Deducing IBM's Answer to Jeopardy

Watson, an IBM built Question Answering computer, is most popular for its extraordinary performance in the popular TV show Jeopardy. Most recently, it has been in the news for its application in financial services and healthcare. Watson was built as part of the DeepQA Project, and an overview of the project was published in a 2010 paper. This paper was part of an AI course that I took in Fall 2013. Below is a quick summary of how WATSON solves Jeopardy questions that I wrote (rather hastily) as part of an assignment.

Building Watson: An overview of the QA project

Question Answering technology can be useful to professionals in timely and sound decision making in the areas of compliance, legal, healthcare, etc. With this technology in mind, Ferrucci and et al. started working on a project to build Watson, a QA computer system that would compete in Jeopardy. Jeopardy is a TV competition that asks natural language questions from a broad domain to the competitors. To answer a question and get points, the competitor should be both accurate, confident and fast. To compete at human level, Watson would have to be able to answer around 70% of the questions asked in less than 3 second each with 80% precision.

To meet its goal, Watson incorporates QA technologies such as parsing, question classification, knowledge representation, etc. The decision for Watson to press or not press the buzzer in order to answer a question would depend on its confidence on its answer generated. Since most questions in Jeopardy are complex, there are different components of the questions that need to be evaluated separately. Hence, to calculate the confidence for an answer, each component's confidences is combined to generate the answer's combined confidence. Confidence estimation is an important part of QA technology.

An answer for a Jeopardy question might not be generated from the whole question. The subclues could be generated from two different parts of the question sentence, or the subclues could be hidden within another subclue in the sentence. Similarly, puzzles in the question can be categorized for special processing.

For Watson, certain kinds of Jeopardy questions were excluded. Questions with audio visual components have been excluded as they are outside the scope.

The domain: A Lexical Answer Type (LAT) is a word in the clue that indicates the type of the answer. In 20,000 questions, 2500 distinct LATs were discovered. It is worth mentioning that about 12% of the questions did not have any distinct LAT. The most frequent 200 LATs made less than 50% of the total information. Since figuring out the distinct LAT for a question is a complex process, a system like Watson must have a sound Natural Language Processing and Knowledge Representation and Reasoning technology.

The metrics: To play Jeopardy, factors such as betting strategy and luck of the draw is important as well. But for building Watson, the main factors being investigated are speed, confidence and correctness. Precision is the number of correct answers out of the number of questions Watson chose to answer. Whether Watson chooses to answer a question is based on the confidence threshold. Any confidence score above the confidence threshold would mean Watson would press the buzzer in order to answer the question. This threshold controls the tradeoff between precision and percent answered. Data collected show that human winners usually have 85%-95% precision with 40% - 50% of the questions answered.

Watson uses the DeepQA technology, which is a massively parallel, probabilistic evidence based architecture. More than 100 techniques are used for analyzing natural language, collecting sources, finding and generating hypotheses, finding and ranking hypotheses. The main principals implemented are: massive parallelism for multiple interpretation, combination of probabilistic questions from a broad domain, contribution of all components of a question towards its confidence generation, use of strict and shallow semantics.

Content Acquisition: This is the process of collecting the content to be used for answer generation. There are a manual process and an automatic one. First, it is important to analyze the question to find out what type of answer is required (manual), and if possible the category of the question, possibly characterized by a LAT (automatic). The sources of content are literary works, dictionaries, encyclopedias, etc. After question evaluation, the process of content evaluation begins. Four high level steps take place: identify seed source documents from the web, extract the text nuggets from the documents, give scores to each text nugget depending on their informative nature with respect to their source document, merge the text nuggets to give one result to be evaluated.

Question Analysis: The process of Question Analysis uses many techniques, but they mainly consist of Question classification, LAT detection, Relation Detection and Decomposition. Question classification is the process of using words or phrases to identify the type of the question by considering the phrases' syntactic, semantic or rhetorical functionality. Some example of question types are puzzles, math question, definition question, etc. LAT detection involves determining whether the answer of a question can be an instance of a LAT. This can be done by replacing a component of the question with the LAT and then gathering evidence for the correctness of the given LAT candidate. Similarly, many questions in Jeopardy contain relations such as syntactic subject-verb-object predicates or semantics relationships. Watson uses relation detecting in many of its QA process, from LAT detection to answer scoring. Because of the large knowledge domain in Jeopardy, detecting most frequent relations can be helpful in only about 25% of the Jeopardy questions.

Hypotheses generation: After question analysis, the QA system generates candidate answers by searching the system's sources and generating answer sized snippets. This stage consists of two processes: Primary search and candidate answer generation. The goal of Primary search is to find as many answer-bearing contents as possible. Many search techniques are used, mainly: multiple text-search engines, document search as well as passage search, knowledge base search using SPARQL.

Soft filtering: After candidate generation, there are algorithms that are light-weight (less resource intensive) that prune the larger set of candidates to a smaller set. There is usually a threshold for the number of candidates in the final set. Candidates that are pruned are taken to the final merging step.

Hypothesis and evidence sorting: After pruning, the candidate answers go through a stronger evaluation process. This step consists of two main processes: Evidence Retrieval and Scoring. Evidence Retrieval is the process in which contents that provide evidence for a candidate answer. Search techniques like passage search using a question component with the candidate answer as query are used. Scoring: After evidence gathering, each answer-candiate is given confidence scores based on different factors. Watson itself employs more than 50 scoring components that range from formal probabilities to categorical features. These scorers use logical, geospatial, relational, logical reasoning. After these scores have been generated, they are merged for each candidate answer. The merged scores are then ranked by a system that has already been run over a training questions whose answers are known.






0 comments:

Post a Comment