Watson, an IBM built Question Answering computer, is most popular for its extraordinary performance in the popular TV show Jeopardy. Most recently, it has been in the news for its application in financial services and healthcare. Watson was built as part of the DeepQA Project, and an overview of the project was published in a 2010 paper. This paper was part of an AI course that I took in Fall 2013. Below is a quick summary of how WATSON solves Jeopardy questions that I wrote (rather hastily) as part of an assignment.
Building Watson: An overview of the
QA project
Question Answering technology can be
useful to professionals in timely and sound decision making in the
areas of compliance, legal, healthcare, etc. With this technology in
mind, Ferrucci and et al. started working on a project to build
Watson, a QA computer system that would compete in Jeopardy. Jeopardy
is a TV competition that asks natural language questions from a broad
domain to the competitors. To answer a question and get points, the
competitor should be both accurate, confident and fast. To compete at
human level, Watson would have to be able to answer around 70% of the
questions asked in less than 3 second each with 80% precision.
To meet its goal, Watson incorporates
QA technologies such as parsing, question classification, knowledge
representation, etc. The decision for Watson to press or not press
the buzzer in order to answer a question would depend on its
confidence on its answer generated. Since most questions in Jeopardy
are complex, there are different components of the questions that
need to be evaluated separately. Hence, to calculate the confidence
for an answer, each component's confidences is combined to generate
the answer's combined confidence. Confidence estimation is an
important part of QA technology.
An answer for a Jeopardy question
might not be generated from the whole question. The subclues could be
generated from two different parts of the question sentence, or the
subclues could be hidden within another subclue in the sentence.
Similarly, puzzles in the question can be categorized for special
processing.
For Watson, certain kinds of Jeopardy
questions were excluded. Questions with audio visual components have
been excluded as they are outside the scope.
The domain: A Lexical Answer
Type (LAT) is a word in the clue that indicates the type of the
answer. In 20,000 questions, 2500 distinct LATs were discovered. It
is worth mentioning that about 12% of the questions did not have any
distinct LAT. The most frequent 200 LATs made less than 50% of the
total information. Since figuring out the distinct LAT for a question
is a complex process, a system like Watson must have a sound Natural
Language Processing and Knowledge Representation and Reasoning
technology.
The metrics: To play Jeopardy,
factors such as betting strategy and luck of the draw is important as
well. But for building Watson, the main factors being investigated
are speed, confidence and correctness. Precision is the number of
correct answers out of the number of questions Watson chose to
answer. Whether Watson chooses to answer a question is based on the
confidence threshold. Any confidence score above the confidence
threshold would mean Watson would press the buzzer in order to answer
the question. This threshold controls the tradeoff between precision
and percent answered. Data collected show that human winners usually
have 85%-95% precision with 40% - 50% of the questions answered.
Watson uses the DeepQA technology,
which is a massively parallel, probabilistic evidence based
architecture. More than 100 techniques are used for analyzing natural
language, collecting sources, finding and generating hypotheses,
finding and ranking hypotheses. The main principals implemented are:
massive parallelism for multiple interpretation, combination of
probabilistic questions from a broad domain, contribution of all
components of a question towards its confidence generation, use of
strict and shallow semantics.
Content Acquisition: This is the
process of collecting the content to be used for answer generation.
There are a manual process and an automatic one. First, it is
important to analyze the question to find out what type of answer is
required (manual), and if possible the category of the question,
possibly characterized by a LAT (automatic). The sources of content
are literary works, dictionaries, encyclopedias, etc. After question
evaluation, the process of content evaluation begins. Four high level
steps take place: identify seed source documents from the web,
extract the text nuggets from the documents, give scores to each text
nugget depending on their informative nature with respect to their
source document, merge the text nuggets to give one result to be
evaluated.
Question Analysis: The process
of Question Analysis uses many techniques, but they mainly consist of
Question classification, LAT detection, Relation Detection and
Decomposition. Question classification is the process of using words
or phrases to identify the type of the question by considering the
phrases' syntactic, semantic or rhetorical functionality. Some
example of question types are puzzles, math question, definition
question, etc. LAT detection involves determining whether the answer
of a question can be an instance of a LAT. This can be done by
replacing a component of the question with the LAT and then gathering
evidence for the correctness of the given LAT candidate. Similarly,
many questions in Jeopardy contain relations such as syntactic
subject-verb-object predicates or semantics relationships. Watson
uses relation detecting in many of its QA process, from LAT detection
to answer scoring. Because of the large knowledge domain in Jeopardy,
detecting most frequent relations can be helpful in only about 25% of
the Jeopardy questions.
Hypotheses generation: After
question analysis, the QA system generates candidate answers by
searching the system's sources and generating answer sized snippets.
This stage consists of two processes: Primary search and candidate
answer generation. The goal of Primary search is to find as many
answer-bearing contents as possible. Many search techniques are used,
mainly: multiple text-search engines, document search as well as
passage search, knowledge base search using SPARQL.
Soft filtering: After candidate
generation, there are algorithms that are light-weight (less resource
intensive) that prune the larger set of candidates to a smaller set.
There is usually a threshold for the number of candidates in the
final set. Candidates that are pruned are taken to the final merging
step.
Hypothesis and evidence sorting:
After pruning, the candidate answers go through a stronger evaluation
process. This step consists of two main processes: Evidence Retrieval
and Scoring. Evidence Retrieval is the process in which contents that
provide evidence for a candidate answer. Search techniques like
passage search using a question component with the candidate answer
as query are used. Scoring: After evidence gathering, each
answer-candiate is given confidence scores based on different
factors. Watson itself employs more than 50 scoring components that
range from formal probabilities to categorical features. These
scorers use logical, geospatial, relational, logical reasoning. After
these scores have been generated, they are merged for each candidate
answer. The merged scores are then ranked by a system that has
already been run over a training questions whose answers are known.
0 comments:
Post a Comment