I recently took Andrew Ng’s Machine Learning course on Coursera, and I’m hoping to write a series of blog posts on what I learnt. In these we will look at a variety of machine learning techniques and categories, starting with linear and logistic regression.
Machine learning: field of study that gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959)
Machine learning is something of a buzzword at the moment, but underneath all the hype it’s a technology that’s expected to revolutionise virtually all industries, and have a huge impact on people’s lives in the coming decades.
Machine learning problems can be split into supervised learning and unsupervised learning. Supervised learning works by giving the algorithm the “right answers”, which are used to train the algorithm so that it can fit and predict when given new examples. With unsupervised learning, you give the algorithm data, which it then learns to categorise and group based on similarities.
Supervised learning can be further split into two regressions: linear and logistic. With linear problems we are trying to predict results with a continuous output, whereas with logistic problems we are trying to fit data into categories. Let’s look first at linear problems.
One way to perform regressions is gradient descent, which involves deriving the cost function of our hypothesis in order to find the steepest direction to the minimum. The cost function essentially compares the predicted results against their known values across the data set – in other words it’s a measure of the accuracy of the hypothesis.
Imagine a graph with the x and y values representing the features of the data, and the z axis representing the cost function. Plot the values for each combination of the data and this graph will represent the regression steps towards the minimum.
In order to perform gradient descent for a regression we must repeat the regression step until we converge on the global minimum.
The above formulae are repeated for each step until the regression is complete. The alpha in the formulae is the learning rate, which controls how big the step taken is for each iteration. If the learning rate is too small it will take a long time to converge, whereas if it is too large it may not converge at all as the regression may keep hopping around the minimum but never actually reach it. The above formula represents only a regression with a couple of variables, whereas in reality we will have many more, possibly hundreds or even thousands, so we will need a simplified formula that can perform the regression step for all the variables in one go.
Another way to speed up linear regression is to use feature scaling, which is where we scale each variable so that it is in roughly the same scale as the others. For instance if we had a variable which ranged from around 0.0001 to 0.001 and another which was around 100 to 1000 we would want to scale these so that they were both around 0 to 1 by multiplying the former and dividing the latter. This prevents gradient decent from oscillating inefficiently down to the optimum when the variables are very uneven.
Linear regression is ineffective at solving classification problems because classification in not actually a linear function. The classification problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values. For now, we will focus on the binary classification problem in which y can take on only two values, 0 and 1.
The above formula is the cost function (hθ) and the below formula is the regression step. Notice how the regression step function (below) is the same as for linear regression, the only difference being the cost function.
The problem of over-fitting
An issue you can encounter with regressions is over-fitting of the data set – where the analysis attempts to model more detail than can be supported by the data, essentially incorporating noise in the results. A simple way to improve this is to reduce the number of features, manually deciding which ones aren’t important.
The other way to reduce over-fitting is to use regularisation. This works by reducing the magnitude of the features, and is especially effective when all the features are useful.
Here the λ, or lambda, is the regularisation parameter. It determines by how much the costs of our theta parameters are inflated.
For linear regression the regularised cost function is:
And for logistic regression (classification problems) the regularised cost function is:
Regularisation in action
Regularisation “smooths out” the fit by reducing the impact of features which would otherwise distort the line of best fit to match the available data exactly.
The graph above gives an example of how regularisation can improve the fit. The blue line represents the unregularised regression, and you can see how the quadratic features have fit the current data set very well to hit the data at the right points. However, if you were to present this trained algorithm with new data it wouldn’t match that data very well, because it is over-fitting the training set. The purple line represents the regularised regression. You can see how, if presented with new data, this would be more likely to still fit that data relatively well.
This concludes the first in my series of machine learning blog posts. The following ones will look into other kinds of machine learning such as neural networks, unsupervised learning and recommendation systems. I took Andrew Ng’s course to increase my understanding of a technology that’s set to play an increasingly big role in our lives – I hope these posts will pass on some of my fascination for the subject.