Tuesday, September 1, 2015

Complexity Analysis of Uninformed Tree Search

Now that we have covered the three most popular uninformed search algorithms, let's look at how they compare against each other when it comes to complexity. The decision to pick one algorithm over another is dependent on whether you care more about time or space.

First, let's get a few terminologies out of the way:

Completeness: Will the algorithm eventually reach the goal node?
Space: How much space cost does the algorithm incur? Maximum number of nodes in memory
Optimality: Will it find the optimal solution (the shortest path to the goal node)?
Time: Number of nodes generated
Step Cost: The distance between two consecutive nodes explored by an algorithm

Now, let's look at each algorithm individually:

1. Breadth First Search
  • Completeness: Yes. If the number of nodes is finite, the algorithm will eventually reach the goal as it traverses every node in all levels/depth of the graph until the goal node is reached
  • Time: O(bd)  here b is the branching factor (the number of children each node generates) and d is the depth of the goal node from the source node 
  • Space: O(bd) because every node in the graph is kept in memory
  • Optimality: No. BFS traverses nodes in increasing depth so unless the cost is a non-decreasing function of depth, BFS does not find the shortest path to the goal node.

Space is a bigger problem for BFS. Just take a look at the table below where both time and space costs are shown for a tree with branching factor (b) = 10.



2. Depth First Search
  • Completeness: Not in infinite space where a graph has infinite depth or in cases where the search space has a loop. The algorithm could be modified to record repeated graph states to avoid getting stuck in a loop.
  • Time: O(bm) where m is the maximum depth of any node. b is the branching factor (number of children generated by each node)
  • Space: O(b*m). The space complexity is linear and hence DFS is space friendly. This is because the only nodes kept in memory are the nodes in a vertical path. In the image below, each shade of blue represents one path/iteration. Only nodes in the particular iteration is kept in memory.

  • Optimality: No. 

3. Uniform Cost Search (Cheapest First)
  • Completeness: Yes, if step cost >= e >= 0
  • Time: O(b(c*/e)) where c* is the optimal cost and e is the minimum step cost 
  • Space: O(b(c*/e))
  • Optimality: UCS does find the shortest path to goal node as it expands nodes in the order of increasing distance from the root node. 

Saturday, August 15, 2015

Uniform Cost Search

Now that we've covered the two most popular uninformed search algorithms (Breadth First (BFS) and Depth First search(DFS)), it's time for one more uninformed graph search algorithm: Uniform Cost Search (UCS). While BFS and DFS depend on nodes' positions for the order of expansion, UCS depends on nodes' cost for the order in which nodes get expanded.

UCS works by maintaining a priority queue of un-visited nodes that uses a cost/distance function g(n) to order the nodes inside the queue in an ascending order. Hence, each node has its own distinct cost/distance value that represents the distance from the source.

 Assume a root node n with distance/cost 0a goal node g, and a priority queue Q with only the root node in it.

From Q, start with the root node. Expand its children, and insert all un-visited child nodes to Q with each child node's distance/cost being the distance/cost of the parent node + the edge distance between the parent node and the child node. If a child node already exists in Q and the calculated distance is smaller than the distance recorded with the child node in Q, update the already existing node in Q with the new distance. Once all children nodes are expanded, from Q, get the node with the lowest distance/cost and repeat the children expansion until goal node g is encountered.

Below is a high level code for uniform cost search:

Uniform Cost Search


public Node UniformCostSearch(Node root, Node goal ){ 
  
   for (Node child: temp.getChildren()){    
      PriorityQueue<Node> q = new PriorityQueue<Node>();
      q.insert(root);  
  
      while !q.isEmpty(){   
        Node temp = q.get(0);  
        if (temp.equals(goal)) return temp;   
        if q.contains(child){           
          if(q.get(child.value).distance > child.distance){
            q.get(child.value).distance = child.distance;       
          }         
        }           
        else{        
          q.insert(child);   
       }    
     }   
   }
}


And here's the wrapper Node class to store a node's distance and value:

private class Node{
    int distance;
    int value;
    
    private Node(int v){
        this.value = v;
    }
}

I have not implemented the getChildren() method but that should be easy to implement by traversing all the links a node has.

Sunday, June 14, 2015

Getting to know Siri: Dynamic Time Warping algorithm for speech recognition

Siri is synonymous to that familiar voice in our iPhones today but until recently I did not how how Siri and other interactive voice applications/programs work. Determined to rectify the situation, I went hunting for answers.

A very high level approximation for the process can be described in the process flow below:



For Step 1 different speech recognition algorithms (of which Dynamic Time Warping is one) can be used. Step 2 requires natural language processing. In this post, I focus on the first step, and more specifically on the DTW algorithm.

The basic premise is that in order to recognize an input from the user/speaker, you need to compare the input with several "templates". A template is basically a feature sequence of syllables/letters that represents a word and is compared with the input sequence the speaker provides. The DTW algorithm is used to get a good input-template alignment. For example, a good template - input alignment would look like this: 





Technically speaking, we have two time sequences:
Q = q1,q2,...,qi,...,qn  
C = c1,c2,...,cj,...c 

If Q represents the template time sequence and C represents the input time sequence, we could construct a m-by-n matrix with C (input) on the x-axis and Q (template) on the y-axis.





The (ith , jth ) element of the matrix corresponds to the squared distance, (q– cj)2 Hence, to find the best match/ a good alignment, the problem becomes to find a path through the matrix that minimizes the cumulative distance of the path.

The cumulative path distance can also be called the Warping Cost and the problem to minimize the warping cost is represented as:

where wk is the matrix element (i,j)k that also belongs to kth element of a warping path W. The problem can be solved using dynamic programming, an algorithm type that solves a problem by iteratively breaking the problem into similar smaller problems.   

Going back to the optimal path, in DTW there are only 3 routes that that the path in a frame can take:



Given these three allowed routes from a frame, we can list all the possible paths in the matrix. And the path minimizing the distance between the input sequence and the template sequence is the optimal path. The complexity for finding the optimal path is O(m*n*3).




The final step is to stack up several templates against the input sequence. The template whose optimal path has the lowest distance (Warping Cost) is the optimal template and the word corresponding to the template is used for language processing by Siri.



Saturday, February 7, 2015

Data Analysis - Back to the Basics

Regression -- the simplest form of machine learning. Speaking of simple things, often times the simplest solution is the best solution. Coming back to this post, this sample case study revolves around the problem of predicting an engagement.

Data Prep

Feature Selection

Now there are many things that can affect a user's decision to engage with an advertisement. So, which of these variables do you pick when doing an analysis?

A common way to simplify the problem is to categorize the variables into different bigger features. For example, you could perform two different analysis: one on the effect of the user's available demographic information and another analysis on the effect of a user's past browsing history. Even within these analysis, one can get confused on how many sub-variables to keep. It's a common problem in data science, one that's called feature selection. One way to select only the significant features is to run a simple random-forest model on all variables. Many implementations of random-forest available online (such as sklearn's) have an attribute called feature_importance that gives a numeric score to each feature. You can then select the most significant features to perform your analysis on.

Data Balancing

Anyone who has worked with data in a commercial space knows that the real data can be very, very noisy. And in the case of online world, engagements are a rare event. This is the case of unbalanced data and don't make a good dataset for modeling. One good rule of thumb that I use when dealing with binary categorical data is that the un-rare event data size (i.e. the number of non-clicks in my data set) should not be greater than 3x the rare-event data size in my data set. Hence, if my dataset size is 1 million, I need to have at least 1/4th of the data size to be of the rare event category for good balance.

Training Data Size

A good rule of thumb is to use 80% of your dataset for training the model you will use and 20% for testing. It's a good idea to shuffle the dataset before splitting the dataset into test/train category, in case the dataset is arranged in a certain order.

Picking the right model

Knowing the underlying probability

After preparing the data set and ensuring that the right features/variables are used, it is time to pick the right model. One thing that always helps to know is the underlying probability distribution of the data. In the case of online engagement, the underlying probability distribution is binomial distribution where the parameters are defined as follows:

n = number of message views
p = underlying probability for a view to convert into an engagement (extra info: this p follows beta distribution)

In this case the output variable is a binary categorical one: engagement/non-engagement and the features are all numerical. Hence, it made sense to use logistic regression which, given numerical variables, outputs the probability of a success.

In one particular case, a pre-eliminary use of Statsmodels GLM (with family=binomial and link=logit) gave me a 62% accuracy rate (with the afore mentioned 20% of the testing set; 80% of the  dataset was used to train the model). In this context, accuracy or the AUC score, above 0.6 is considered good enough.

When talking about logistic regression, it is important to talk about the threshold value. Logistic regression outputs the probability that a given case will result to a success case (an engagement). While testing, I needed to determine what the proper threshold of probability for a success case is. It might seem intuitive to say that if an outputted probability is above 0.5, it can be categorized as a success/engagement. However, that does not always gives the best results. Trial and error is the best way to go, and in my case, a threshold of 0.45 gave me the highest accuracy.

With a little help from my friends: Literature

 In most cases, the problem in hand has already been studied by other companies/educational and research organizations. Becoming aware of the advancements in the problem space is definitely helpful. In the above case, after getting an AUC score of 62%, I was ready to end my analysis. However, I came across this paper from Facebook that suggests a hybrid model of using gradient boosted regression (gradient boosted decision trees + logistic regression). To elaborate, Facebook suggests a way of feature transformation using decision trees and using the transformed features in a logistic regression model. Here's a good visualization of the hybrid model:



A good approximate implementation of this hybrid model is xgboost. Using this model, my improved AUC score/accuracy was 0.7 (70%).

Tuning

Just picking the right model does not do the trick, usually. For any model, it's important that the parameters that determine how the model trains with the training data you have provided are well-tuned. For this, one needs to understand what the parameters mean and how they affect the model training. In the case of xgboost, following were some of the values I used for the parameters:

max_depth (Maximum depth of trees): 6
eta (the learning rate) :0.5
objective: binary:logistic


~~~~~~~~~~~~~~~~~

To conclude, data analysis is just not picking the right model. It's a lot of data cleaning and tuning and testing decisions. There is usually not a hard and fast answer for a data modeling problem and the only way to find the best solution is to iterate and try different things in each step of the modeling.

Thursday, January 8, 2015

Machine Learning: An Intro

Machine Learning (ML) is an area in Artificial Intelligence (AI) that deals with replicating learning processes used by humans in the form of algorithms. Ever since Alan Turing introduced the Turing Test in 1950, a whole new sector of reseeach emerged on creating artificial intelligence. But moving onto ML, there are many sub-kinds/sub-topics within ML.
  
But what is a Learning Problem? A Learning problem can be formalized as improving the agent experience E over task T with a performance measure P. For example, to learn to play chess is a learning problem with the task (T) of playing chess. The goal for a learning is to come up with the best hypotheses (some kind of algorithm or a simple function) that when used, can perform the task T.

Now that we know what a learning problem is, let's dive into the types of Machine Learning. Most kinds of learning can be divided into 3 sub-groups: Supervised Learning, Unsupervised Learning, Reinforcement Learning. Supervised Learning uses labeled and structured data sets for training whereas unsupervised learning does not. Labeled data sets have input and output variables defined. Reinforcement learning does not use any training data sets at all. 

Supervised Learning 

 Even within supervised learning, there are two kinds of learning: Inductive and Analytical learning.

Inductive Learning:
- Needs large sets of training data
- Does not require prior background knowledge or rules (often called domain theory)
                The following rules can make a domain theory:
                                                            Cup <= Stable, Liftable, Open vessel
                                                             Stable <= BottomIsFlat
- Basically, statistical inference to fit the training data
- Example: Simple regression, Decision Tree Learning, Neural Networks, Genetic Algorithms

Analytical Learning:
- Does not need large sets of training data
- Required domain theory, i.e. background knowledge of rules for the environment
- Logical deduction to fit the training data and the background knowledge
- Ezample: Example Based Learning


 
Unsupervised Learning

Unsupervised Learning doesn't use labeled data sets for training the task performing hypothesis. The most famous example for this group is Clustering.   
 
Reinforcement Learning

Reinforcement Learning does not use any training data set. Instead it used the reward-penalty concept in psychology. Given there is an agent in a certain states with certain number of actions available for him to perform, which action should it choose? The agent should choose a sequence of actions that maximize the cumulative reward at the end.