Tuesday, September 1, 2015

Complexity Analysis of Uninformed Tree Search

Now that we have covered the three most popular uninformed search algorithms, let's look at how they compare against each other when it comes to complexity. The decision to pick one algorithm over another is dependent on whether you care more about time or space.

First, let's get a few terminologies out of the way:

Completeness: Will the algorithm eventually reach the goal node?
Space: How much space cost does the algorithm incur? Maximum number of nodes in memory
Optimality: Will it find the optimal solution (the shortest path to the goal node)?
Time: Number of nodes generated
Step Cost: The distance between two consecutive nodes explored by an algorithm

Now, let's look at each algorithm individually:

1. Breadth First Search
  • Completeness: Yes. If the number of nodes is finite, the algorithm will eventually reach the goal as it traverses every node in all levels/depth of the graph until the goal node is reached
  • Time: O(bd)  here b is the branching factor (the number of children each node generates) and d is the depth of the goal node from the source node 
  • Space: O(bd) because every node in the graph is kept in memory
  • Optimality: No. BFS traverses nodes in increasing depth so unless the cost is a non-decreasing function of depth, BFS does not find the shortest path to the goal node.

Space is a bigger problem for BFS. Just take a look at the table below where both time and space costs are shown for a tree with branching factor (b) = 10.



2. Depth First Search
  • Completeness: Not in infinite space where a graph has infinite depth or in cases where the search space has a loop. The algorithm could be modified to record repeated graph states to avoid getting stuck in a loop.
  • Time: O(bm) where m is the maximum depth of any node. b is the branching factor (number of children generated by each node)
  • Space: O(b*m). The space complexity is linear and hence DFS is space friendly. This is because the only nodes kept in memory are the nodes in a vertical path. In the image below, each shade of blue represents one path/iteration. Only nodes in the particular iteration is kept in memory.

  • Optimality: No. 

3. Uniform Cost Search (Cheapest First)
  • Completeness: Yes, if step cost >= e >= 0
  • Time: O(b(c*/e)) where c* is the optimal cost and e is the minimum step cost 
  • Space: O(b(c*/e))
  • Optimality: UCS does find the shortest path to goal node as it expands nodes in the order of increasing distance from the root node. 

Saturday, August 15, 2015

Uniform Cost Search

Now that we've covered the two most popular uninformed search algorithms (Breadth First (BFS) and Depth First search(DFS)), it's time for one more uninformed graph search algorithm: Uniform Cost Search (UCS). While BFS and DFS depend on nodes' positions for the order of expansion, UCS depends on nodes' cost for the order in which nodes get expanded.

UCS works by maintaining a priority queue of un-visited nodes that uses a cost/distance function g(n) to order the nodes inside the queue in an ascending order. Hence, each node has its own distinct cost/distance value that represents the distance from the source.

 Assume a root node n with distance/cost 0a goal node g, and a priority queue Q with only the root node in it.

From Q, start with the root node. Expand its children, and insert all un-visited child nodes to Q with each child node's distance/cost being the distance/cost of the parent node + the edge distance between the parent node and the child node. If a child node already exists in Q and the calculated distance is smaller than the distance recorded with the child node in Q, update the already existing node in Q with the new distance. Once all children nodes are expanded, from Q, get the node with the lowest distance/cost and repeat the children expansion until goal node g is encountered.

Below is a high level code for uniform cost search:

Uniform Cost Search


public Node UniformCostSearch(Node root, Node goal ){ 
  
   for (Node child: temp.getChildren()){    
      PriorityQueue<Node> q = new PriorityQueue<Node>();
      q.insert(root);  
  
      while !q.isEmpty(){   
        Node temp = q.get(0);  
        if (temp.equals(goal)) return temp;   
        if q.contains(child){           
          if(q.get(child.value).distance > child.distance){
            q.get(child.value).distance = child.distance;       
          }         
        }           
        else{        
          q.insert(child);   
       }    
     }   
   }
}


And here's the wrapper Node class to store a node's distance and value:

private class Node{
    int distance;
    int value;
    
    private Node(int v){
        this.value = v;
    }
}

I have not implemented the getChildren() method but that should be easy to implement by traversing all the links a node has.

Sunday, June 14, 2015

Getting to know Siri: Dynamic Time Warping algorithm for speech recognition

Siri is synonymous to that familiar voice in our iPhones today but until recently I did not how how Siri and other interactive voice applications/programs work. Determined to rectify the situation, I went hunting for answers.

A very high level approximation for the process can be described in the process flow below:



For Step 1 different speech recognition algorithms (of which Dynamic Time Warping is one) can be used. Step 2 requires natural language processing. In this post, I focus on the first step, and more specifically on the DTW algorithm.

The basic premise is that in order to recognize an input from the user/speaker, you need to compare the input with several "templates". A template is basically a feature sequence of syllables/letters that represents a word and is compared with the input sequence the speaker provides. The DTW algorithm is used to get a good input-template alignment. For example, a good template - input alignment would look like this: 





Technically speaking, we have two time sequences:
Q = q1,q2,...,qi,...,qn  
C = c1,c2,...,cj,...c 

If Q represents the template time sequence and C represents the input time sequence, we could construct a m-by-n matrix with C (input) on the x-axis and Q (template) on the y-axis.





The (ith , jth ) element of the matrix corresponds to the squared distance, (q– cj)2 Hence, to find the best match/ a good alignment, the problem becomes to find a path through the matrix that minimizes the cumulative distance of the path.

The cumulative path distance can also be called the Warping Cost and the problem to minimize the warping cost is represented as:

where wk is the matrix element (i,j)k that also belongs to kth element of a warping path W. The problem can be solved using dynamic programming, an algorithm type that solves a problem by iteratively breaking the problem into similar smaller problems.   

Going back to the optimal path, in DTW there are only 3 routes that that the path in a frame can take:



Given these three allowed routes from a frame, we can list all the possible paths in the matrix. And the path minimizing the distance between the input sequence and the template sequence is the optimal path. The complexity for finding the optimal path is O(m*n*3).




The final step is to stack up several templates against the input sequence. The template whose optimal path has the lowest distance (Warping Cost) is the optimal template and the word corresponding to the template is used for language processing by Siri.



Saturday, February 7, 2015

Data Analysis - Back to the Basics

Regression -- the simplest form of machine learning. Speaking of simple things, often times the simplest solution is the best solution. Coming back to this post, this sample case study revolves around the problem of predicting an engagement.

Data Prep

Feature Selection

Now there are many things that can affect a user's decision to engage with an advertisement. So, which of these variables do you pick when doing an analysis?

A common way to simplify the problem is to categorize the variables into different bigger features. For example, you could perform two different analysis: one on the effect of the user's available demographic information and another analysis on the effect of a user's past browsing history. Even within these analysis, one can get confused on how many sub-variables to keep. It's a common problem in data science, one that's called feature selection. One way to select only the significant features is to run a simple random-forest model on all variables. Many implementations of random-forest available online (such as sklearn's) have an attribute called feature_importance that gives a numeric score to each feature. You can then select the most significant features to perform your analysis on.

Data Balancing

Anyone who has worked with data in a commercial space knows that the real data can be very, very noisy. And in the case of online world, engagements are a rare event. This is the case of unbalanced data and don't make a good dataset for modeling. One good rule of thumb that I use when dealing with binary categorical data is that the un-rare event data size (i.e. the number of non-clicks in my data set) should not be greater than 3x the rare-event data size in my data set. Hence, if my dataset size is 1 million, I need to have at least 1/4th of the data size to be of the rare event category for good balance.

Training Data Size

A good rule of thumb is to use 80% of your dataset for training the model you will use and 20% for testing. It's a good idea to shuffle the dataset before splitting the dataset into test/train category, in case the dataset is arranged in a certain order.

Picking the right model

Knowing the underlying probability

After preparing the data set and ensuring that the right features/variables are used, it is time to pick the right model. One thing that always helps to know is the underlying probability distribution of the data. In the case of online engagement, the underlying probability distribution is binomial distribution where the parameters are defined as follows:

n = number of message views
p = underlying probability for a view to convert into an engagement (extra info: this p follows beta distribution)

In this case the output variable is a binary categorical one: engagement/non-engagement and the features are all numerical. Hence, it made sense to use logistic regression which, given numerical variables, outputs the probability of a success.

In one particular case, a pre-eliminary use of Statsmodels GLM (with family=binomial and link=logit) gave me a 62% accuracy rate (with the afore mentioned 20% of the testing set; 80% of the  dataset was used to train the model). In this context, accuracy or the AUC score, above 0.6 is considered good enough.

When talking about logistic regression, it is important to talk about the threshold value. Logistic regression outputs the probability that a given case will result to a success case (an engagement). While testing, I needed to determine what the proper threshold of probability for a success case is. It might seem intuitive to say that if an outputted probability is above 0.5, it can be categorized as a success/engagement. However, that does not always gives the best results. Trial and error is the best way to go, and in my case, a threshold of 0.45 gave me the highest accuracy.

With a little help from my friends: Literature

 In most cases, the problem in hand has already been studied by other companies/educational and research organizations. Becoming aware of the advancements in the problem space is definitely helpful. In the above case, after getting an AUC score of 62%, I was ready to end my analysis. However, I came across this paper from Facebook that suggests a hybrid model of using gradient boosted regression (gradient boosted decision trees + logistic regression). To elaborate, Facebook suggests a way of feature transformation using decision trees and using the transformed features in a logistic regression model. Here's a good visualization of the hybrid model:



A good approximate implementation of this hybrid model is xgboost. Using this model, my improved AUC score/accuracy was 0.7 (70%).

Tuning

Just picking the right model does not do the trick, usually. For any model, it's important that the parameters that determine how the model trains with the training data you have provided are well-tuned. For this, one needs to understand what the parameters mean and how they affect the model training. In the case of xgboost, following were some of the values I used for the parameters:

max_depth (Maximum depth of trees): 6
eta (the learning rate) :0.5
objective: binary:logistic


~~~~~~~~~~~~~~~~~

To conclude, data analysis is just not picking the right model. It's a lot of data cleaning and tuning and testing decisions. There is usually not a hard and fast answer for a data modeling problem and the only way to find the best solution is to iterate and try different things in each step of the modeling.

Thursday, January 8, 2015

Machine Learning: An Intro

Machine Learning (ML) is an area in Artificial Intelligence (AI) that deals with replicating learning processes used by humans in the form of algorithms. Ever since Alan Turing introduced the Turing Test in 1950, a whole new sector of reseeach emerged on creating artificial intelligence. But moving onto ML, there are many sub-kinds/sub-topics within ML.
  
But what is a Learning Problem? A Learning problem can be formalized as improving the agent experience E over task T with a performance measure P. For example, to learn to play chess is a learning problem with the task (T) of playing chess. The goal for a learning is to come up with the best hypotheses (some kind of algorithm or a simple function) that when used, can perform the task T.

Now that we know what a learning problem is, let's dive into the types of Machine Learning. Most kinds of learning can be divided into 3 sub-groups: Supervised Learning, Unsupervised Learning, Reinforcement Learning. Supervised Learning uses labeled and structured data sets for training whereas unsupervised learning does not. Labeled data sets have input and output variables defined. Reinforcement learning does not use any training data sets at all. 

Supervised Learning 

 Even within supervised learning, there are two kinds of learning: Inductive and Analytical learning.

Inductive Learning:
- Needs large sets of training data
- Does not require prior background knowledge or rules (often called domain theory)
                The following rules can make a domain theory:
                                                            Cup <= Stable, Liftable, Open vessel
                                                             Stable <= BottomIsFlat
- Basically, statistical inference to fit the training data
- Example: Simple regression, Decision Tree Learning, Neural Networks, Genetic Algorithms

Analytical Learning:
- Does not need large sets of training data
- Required domain theory, i.e. background knowledge of rules for the environment
- Logical deduction to fit the training data and the background knowledge
- Ezample: Example Based Learning


 
Unsupervised Learning

Unsupervised Learning doesn't use labeled data sets for training the task performing hypothesis. The most famous example for this group is Clustering.   
 
Reinforcement Learning

Reinforcement Learning does not use any training data set. Instead it used the reward-penalty concept in psychology. Given there is an agent in a certain states with certain number of actions available for him to perform, which action should it choose? The agent should choose a sequence of actions that maximize the cumulative reward at the end.


Wednesday, December 10, 2014

On Buzzwords

If my time at an investment bank has taught me anything, I would say  abbreviations and jargon are the biggest barrier to entry to the financial services industry. Why in the world would anyone say real estate instead of space? Hey Jane, can I put my books on your table? I don't have enough real estate in my drawer. 

Now it seems the tech industry has taken a leaf out of the good ole' wall street's book. Even Peter Thiel recently warned venture capitalists to be wary of buzzwords like 'big data' and 'cloud'. 

 So what are the buzz words in tech today? "Disruptive", "Analytics", "Big Data", "hacking life/problems/laundry/literally anything", 'Scalable", "the uber for <x>"... and for the greedy ones among us: SO LO MO. Forget Jesus, apparently SO, LO and MO make the holy trinity in today's startup land:


SO (Social)
 + 
LO (Local)
MO (Mobile)
=
 $$$$$$$

Sarcasm aside, Life Hacker has a more comprehensive list of buzzwords. And go to this website at your own risk for more. Another buzz word that I see emerging is Machine Learning. Now before you go on to claim your product harvests Machine Learning technology to solve <insert a generic problem here> in your website, it would be good to learn what exactly Machine Learning (ML) is.



Saturday, November 15, 2014

Watson! - Deducing IBM's Answer to Jeopardy

Watson, an IBM built Question Answering computer, is most popular for its extraordinary performance in the popular TV show Jeopardy. Most recently, it has been in the news for its application in financial services and healthcare. Watson was built as part of the DeepQA Project, and an overview of the project was published in a 2010 paper. This paper was part of an AI course that I took in Fall 2013. Below is a quick summary of how WATSON solves Jeopardy questions that I wrote (rather hastily) as part of an assignment.

Building Watson: An overview of the QA project

Question Answering technology can be useful to professionals in timely and sound decision making in the areas of compliance, legal, healthcare, etc. With this technology in mind, Ferrucci and et al. started working on a project to build Watson, a QA computer system that would compete in Jeopardy. Jeopardy is a TV competition that asks natural language questions from a broad domain to the competitors. To answer a question and get points, the competitor should be both accurate, confident and fast. To compete at human level, Watson would have to be able to answer around 70% of the questions asked in less than 3 second each with 80% precision.

To meet its goal, Watson incorporates QA technologies such as parsing, question classification, knowledge representation, etc. The decision for Watson to press or not press the buzzer in order to answer a question would depend on its confidence on its answer generated. Since most questions in Jeopardy are complex, there are different components of the questions that need to be evaluated separately. Hence, to calculate the confidence for an answer, each component's confidences is combined to generate the answer's combined confidence. Confidence estimation is an important part of QA technology.

An answer for a Jeopardy question might not be generated from the whole question. The subclues could be generated from two different parts of the question sentence, or the subclues could be hidden within another subclue in the sentence. Similarly, puzzles in the question can be categorized for special processing.

For Watson, certain kinds of Jeopardy questions were excluded. Questions with audio visual components have been excluded as they are outside the scope.

The domain: A Lexical Answer Type (LAT) is a word in the clue that indicates the type of the answer. In 20,000 questions, 2500 distinct LATs were discovered. It is worth mentioning that about 12% of the questions did not have any distinct LAT. The most frequent 200 LATs made less than 50% of the total information. Since figuring out the distinct LAT for a question is a complex process, a system like Watson must have a sound Natural Language Processing and Knowledge Representation and Reasoning technology.

The metrics: To play Jeopardy, factors such as betting strategy and luck of the draw is important as well. But for building Watson, the main factors being investigated are speed, confidence and correctness. Precision is the number of correct answers out of the number of questions Watson chose to answer. Whether Watson chooses to answer a question is based on the confidence threshold. Any confidence score above the confidence threshold would mean Watson would press the buzzer in order to answer the question. This threshold controls the tradeoff between precision and percent answered. Data collected show that human winners usually have 85%-95% precision with 40% - 50% of the questions answered.

Watson uses the DeepQA technology, which is a massively parallel, probabilistic evidence based architecture. More than 100 techniques are used for analyzing natural language, collecting sources, finding and generating hypotheses, finding and ranking hypotheses. The main principals implemented are: massive parallelism for multiple interpretation, combination of probabilistic questions from a broad domain, contribution of all components of a question towards its confidence generation, use of strict and shallow semantics.

Content Acquisition: This is the process of collecting the content to be used for answer generation. There are a manual process and an automatic one. First, it is important to analyze the question to find out what type of answer is required (manual), and if possible the category of the question, possibly characterized by a LAT (automatic). The sources of content are literary works, dictionaries, encyclopedias, etc. After question evaluation, the process of content evaluation begins. Four high level steps take place: identify seed source documents from the web, extract the text nuggets from the documents, give scores to each text nugget depending on their informative nature with respect to their source document, merge the text nuggets to give one result to be evaluated.

Question Analysis: The process of Question Analysis uses many techniques, but they mainly consist of Question classification, LAT detection, Relation Detection and Decomposition. Question classification is the process of using words or phrases to identify the type of the question by considering the phrases' syntactic, semantic or rhetorical functionality. Some example of question types are puzzles, math question, definition question, etc. LAT detection involves determining whether the answer of a question can be an instance of a LAT. This can be done by replacing a component of the question with the LAT and then gathering evidence for the correctness of the given LAT candidate. Similarly, many questions in Jeopardy contain relations such as syntactic subject-verb-object predicates or semantics relationships. Watson uses relation detecting in many of its QA process, from LAT detection to answer scoring. Because of the large knowledge domain in Jeopardy, detecting most frequent relations can be helpful in only about 25% of the Jeopardy questions.

Hypotheses generation: After question analysis, the QA system generates candidate answers by searching the system's sources and generating answer sized snippets. This stage consists of two processes: Primary search and candidate answer generation. The goal of Primary search is to find as many answer-bearing contents as possible. Many search techniques are used, mainly: multiple text-search engines, document search as well as passage search, knowledge base search using SPARQL.

Soft filtering: After candidate generation, there are algorithms that are light-weight (less resource intensive) that prune the larger set of candidates to a smaller set. There is usually a threshold for the number of candidates in the final set. Candidates that are pruned are taken to the final merging step.

Hypothesis and evidence sorting: After pruning, the candidate answers go through a stronger evaluation process. This step consists of two main processes: Evidence Retrieval and Scoring. Evidence Retrieval is the process in which contents that provide evidence for a candidate answer. Search techniques like passage search using a question component with the candidate answer as query are used. Scoring: After evidence gathering, each answer-candiate is given confidence scores based on different factors. Watson itself employs more than 50 scoring components that range from formal probabilities to categorical features. These scorers use logical, geospatial, relational, logical reasoning. After these scores have been generated, they are merged for each candidate answer. The merged scores are then ranked by a system that has already been run over a training questions whose answers are known.