Machine learning concepts for beginners....
1) A simple introduction to Machine learning (ML)
Imagine you happen to key in - ucom'g?amgng - while texting someone. The next time you type in the first four letters, ucom, it would prompt you with the earlier word despite this not being in the standard dictionary. This can be seen as learning from the users' vocabulary and ML is a learning-on-the-go process.
In simple words, it is about collecting examples with specific input and output, learning al the available input/output combinations and responding to new inputs with appropriate output. It is similar to teaching a child that 2+3 = 5, 3+4 = 7, 5 + 3 = 8 and expecting him to give the right answer to 2 + 5 = 7.
Computers are expected to perform this by learning the examples with the aid of algorithms. So, machine learning is about creating or modifying algorithms that can learn from examples to give accurate output for unknown future inputs.
2) Difference between machine learning and statistics
Well, both fields use data to make predictions about future. However, the subtle difference is that relationships must be necessarily be inferred to make these predictions.
In machine learning, the algorithm blindly learns solely from the data as an input-output combination while in statistics the focus is on inferring the relationship between dependent and independent variable so as to make predictions about future.
For e.g. In statistics regression may be used to identify the nature (positive or negative) and strength of relationship between past stock prices and time. In ML, it might be learned as a HMM to study the trend in stock prices and other implicit variables but not the relationship them.
In a nutshell, statistics studies relationship between variables while ML learns to create models of the data.
3) Goals in machine learning
As model building is at the heart of ML, the goal is always to create the best or optimal model. An optimal model is one that is simple, less time consuming and makes accurate predictions for unknown future data.
A key aspect is that the model that learns the available sample data might not be the optimal model as it might have learned not just the data but also the inherent noise in data. For example, assume that 100 samples of digit '3' are fed to a model for handwriting recognition. If one of the examples had '3' written from bottom to top, then it might be fine not to learn this special case or even make error in prediction as most individuals write from top to bottom. The intention in ML is therefore not to fit to every exceptions or noise in data, but to create models that generalise well.
A few techniques are available to check if the model has learned just the data and not the noise in it.
4) Tasks in machine learning
In order to ensure that ML models learn data to create generalised models that can make accurate predictions for unknown future data, a systematic approach is adopted. When data is made available for ML learning purposes, it is divided into 3 parts - training set, validation set and test set.
Training, validation and test sets are used to respectively create, refine and test the models. For e.g. Imagine 10,000 customers' data is made available by a bank to create a model that can predict mortgage defaulting for potential customers. This data may be split into say training set of 8000 customers, validation set of 1000 customers and a test set of 1000 customers.
The common approach is create model by learning the data in training set. The validation set will be used to refine the model as these observations are unseen by the model during its creation. Finally, the refined model is tested on test set to determine prediction accuracy and qualify the model for future use.
It may be noted that the validation set and model refinement are optional.
5) Frequentist and Bayesian approach
Frequentists have the attitude 'seeing is believing' while Bayesians have a gut feeling to begin with. Imagine you ordered pizza and in a few minutes someone rings the bell. A frequentist would imagine that the pizza man is at doorstep as an order has been made. A Bayesian on the other hand, will use his prior knowledge on the distance between the pizza shop and house, the traffic during the time and other attributes to lessen the chance of pizza man being at doorstep if it is rung too early. Thus, a Bayesian always looks at the plausability of the outcome given other constraints. However, a frequentist only looks at what happened and ignores other aspects.
This penalisation of likelihood (i.e. the pizza man at doorstep) with prior belief (i.e. traffic, distance etc.) makes bayesian approach a more consistent and reliable approach to decision making in uncertainties.
Though the choice of prior is contentious, this becomes less pronounced as more data becomes available. It is as good as saying partners with different view points but reasonably open-minded make good couple as they tend to converge on their thoughts as more things happen in their lives.
6) Distribution
A set of values for any attribute or variable may be considered as a distribution as the data pattern will inform of some phenomena. This pattern may be created as a model as well. For e.g. a variable that contains the number of child diabetes over the last 10 years might reveal some pattern that is worth investigating.
Probability distribution
Depiction of the various outcomes of an event to the probability of its occurrence.
Joint probability distribution
A bivariate distribution that depicts the probability of the joint occurrence of two events' outcomes i.e. the likelihood of a specific combination of things occurring. In terms of set theory, it is the probability of A and B occurring i.e. p(A ∩ B) or p(A,B). For e.g. the joint probability of a student taking private tuition and scoring high marks in GCSE will be different to that of students not taking private tuition. p(student scoring high marks, takes tuition) > p(student scoring high marks, no tuition)
Conditional distribution
A set of distribution values conditioned on some knowledge. For e.g. the probability of a child contracting chicken pox at any point in time might be very low, say 0.05. However if anyone in his class contracts the disease (i.e. conditioning), then his probability of contracting it increases, say to 0.5. If he contracts chicken pox, the probability of his mother who takes care of him might be still higher, say 0.8 i.e. p(mother contracting chicken pox|son has chicken pox) = 0.8
Marginalisation/Marginal distribution
Represents the probability of occurrence of a variable with no reference to other variables. For e.g. if probability of property prices flattening in the event of brexit is 0.6 and no brexit is 0.3, marginal distribution is the probability of property prices flattening irrespective of brexit. In this case, brexit has been marginalised to compute the probability of property prices flattening.
Marginal probability p(price flattening) = p(price flattening|brexit) + p(price flattening|no brexit)
Factorisation
In simple terms, it involves splitting of an expression into its component parts. In the context of joint distribution, it seeks to split into conditional distribution and marginal distribution. For e.g. the Labour party stands a good chance of winning next general election if and only if they have a solid economic plan that favours the working class. Therefore, p(Labour 2020 election win) = p(Labour economic policy favouring working class) * p(Labour 2020 election win|working class favouring economic policy). The two terms to the right side of equation are the factors.
7) Machine learning models
A variety of machine learning models are available for the learning and prediction tasks. A broad classification is given below.
Supervised and Unsupervised models
In supervised learning, the input-output combination will be clearly available in the data.
In unsupervised learning, grouping is done based on similarity between observations as the input-output combination isn't available.
For e.g. In unsupervised learning a list of blood test samples from 500 individuals may be grouped into male and female based on the similarity in blood test attributes. In supervised learning, the gender labels would be already available. The model would learn the classification labels based on the observation attributes to classify new observations into appropriate gender.
Generative and Discriminative models
In generative models, each class is modeled (as in the case of supervised learning above) to make future predictions. They also have the capability to generate synthetic data as they are models themselves. The primary question asked during classification is: Does this observation look more like a model any of these classes?
In discriminative models, decision boundary is created between classes and cannot generate synthetic data. The key question asked during classification is: To which side of the decision boundary does the observation fall into?
The earlier example of classifying observations into male and female may be tackled with generative and discriminative models depending on the availability of class labels.
Classification and Regression models
In Regression, the attempt is to predict a continuous attribute while Classification aims to predict a discrete attribute. For e.g. predicting the next day's stock price (i.e. continuous value) is a regression problem, while the prediction of whether the stock price will go up or down (i.e. discrete value) is a classification problem.
Classification and Clustering
In classification problems, observations are assigned to pre-known classes. On the contrary, Clustering is about structure discovery in the data that was not previously known.
As mentione earlier, machine learning models must be optimised for accurate predictions. There are 3 main approaches that can be applied depending on the model.
Error Minimisation
This involves minimising test set error i.e. the squared difference between actual and predicted test set values. Squaring is done to avoid errors being nullified due to negative differences.
Also known as OLS, it is a method for determining unknown parameters in a linear regression model by minimising least squared error.
Maximum Likelihood
Likelihood is the probability of observing the data given the model i.e. p(data|model). Maximum likelihood estimation (MLE) is a statistical method that evaluates unknown parameters of a probability model at the point where probability of observing the data is maximum.
Model parameters or hyperparameters are attributes that define a model; for e.g. mean and variance define a normal distribution.
Gaussian mixture models use MLE approach for optimisation. Though MLE is computationally intensive and slow, it has much lower variance compared to other methods as it uses more of the information in available data. It is however recommended when large sample sizes are available.
Log likelihood is the natural logarithm of likelihood. In MLE described above, the log likelihood is used instead of likelihood due to below reasons:
- When join probability of a model is estimated, it is a product of its factors. When the log of these factors is taken, the entire set of terms reduces to the sum of log of individual factors and this makes calculations simpler. (ln(ab) = ln(a) + ln(b))
- Most models assume a Gaussian distribution and the exponential terms in gaussian model for noise is much easier to deal with when its log is taken for calculations. (log ab = b * log(a))
- Recalling from calculus, a function maxima is determined by taking its first derivative and solving for the parameter being maximised. This is much simpler with log of the terms rather than the original function itself.
- Log of a function is always monotonically increasing and achieves maxima at the same point as the original function. Therefore, it is safe to use log likelihood instead of likelihood.
In a nutshell, MLE estimates parameters of a statistical model and fits statistical model to data.
OLS and MLE are the same if error is assumed to be normal.
Expectation Maximisation (EM)
This approach is used to find parameters of a statistical model via maximum likelihood when the equations cannot be solved analytically. It is primarily adopted when some of the variables in the model are unobserved. This is done iteratively between E and M steps. In the beginning, distribution parameters are arbitarily assumed. The E-step determines expectation of log likelihood for the set parameters and M-step attempts to maximise log-likelihood found in the E-step. If M-step fails to determine maxima, iterations continue by recomputing the distribution parameters.
Maximum-a-posteriori (MAP)
MAP determines a point estimate by taking the maximum value amongst two estimates.
9) Objective Function
Any function that is optimised during training. For e.g. Loss function minimisation in linear regression.
Concave and convex optimisation
Machine learning problems may be viewed as an optimisation problem wherein an objective function is optimised to find the optimal point. This may be the global minima for error minimisation or global maxima for maximum likelihood. Sometimes, optimisation might converge at local optima as well.
Convex functions have a U shape while non-convex or concave functions have inverted U shape. Therefore, identification of global optima is easily feasible for convex functions wherein a unique minima exists. Non-convex functions have more than one minima and only one amongst them will be the global minima. Optimisation is then difficult as it is not easy to differentiate between local and global minima and most optimisation tasks get stuck at local minima.
An analogy for the maximum likelihood point on a curve may be imagined as a drive along a long route with flat and peaky terrain. There might be several high points in the route that offer scenic beauty (like local optima), but there will be one high point which offers the best view amongst all (like global optima).
10) Bias-Variance trade off
Assume that the average salary in England is being studied with data from 5,000 individuals. If the data is averaged to find mean salary for the entire data set, it is likely to have low variance. But, the bias might be too high in London and low in North East. A low bias model can be created by averaging the data per region, say London, North East, South East etc. Average salary in each of these regions will have low bias as the data within a region will be closer to its mean. However, this approach will lead to higher variance due to considerable differences between the regional mean values. Even in this case, there can be values with high and low bias within a region and data may be split to further granularity until the sample size = 1. When prediction is attempted for a new point with the last case (i.e. sample size = 1) variance will be maximum as the predictions obtained from different models will be very different. Thus it may be seen that as data granularity is increased, bias reduces and variance increases.
In a nutshell, variance is the difference in the predictions between different models and bias is the distance of predicted value from actual value.
No comments:
Post a Comment