Underfitting and Overfitting of Model

Underfitting and Overfitting of Machine Learning Model

For supervised learning algorithms, e.g. classification and regression, there are two common cases where the generated model does not fit well the data: underfitting and overfitting.

An important measurement for supervised learning algorithms, is the generalization, which measures how well that a model derived from the training data can predict the desired attribute of the unseen data. When we say a model is underfitting or overfitting, it implies that the model does not generalized well to the unseen data.

A model that fits well with the training data does not necessarily imply that it would generalize well to the unseen data. Because 1). the training data are just samples we collect from the real world, which represents only a proportion of reality. It could be the case that the training data is simply not representative, thus even the model fits perfectly the training data, it would not fit well with the unseen data. 2). the data that we collect contains noises and errors inevitably. The model that fits perfectly with the data, would also capture the undesired noises and errors by mistake, which would eventually lead to bias and errors in the prediction for the unseen data.

Before we dive down into the definition of underfitting and overfitting, here we show some examples of what underfitting and overfitting models look like, in the task of classification.


Underfitting


An underfitting model is the one that does not fit well with the training data, i.e. significantly deviated from the ground truth.

One of the causes of underfitting could be that the model is over-simplified for the data, therefore it is not capable to capture the hidden relationship within the data. As one can see from the above graph No. (1), in order to separate the samples, i.e. classification, a simple linear model (a line) is not capable to clearly draw the boundary among the samples of different categories, which results in significant misclassification.

As a countermeasure to avoid the above cause of underfitting, one can choose an alternative algorithm that is capable to generate a more complex model from the training data set.

Overfitting


An overfitting model is the one that fits well with the training data, i.e. little or no error, however it does not generalized well to the unseen data.

Contrary to the case of underfitting, an over-complicated model that is able to fit every bit of the data, would fall into the traps of noises and errors. As one can see from the above graph No. (3), the model managed to have less misclassification in the training data, yet it is more likely that it would stumble on the unseen data.

Similarly with the underfitting case, to avoid the overfitting, one can try out another algorithm that could generate a simpler model from the training data set. Or more often, one stays with the original algorithm that generated the overfitting model, but adds a regularization term to the algorithm, i.e. penalizing the model that is over-complicated so that the algorithm is steered to generate a less complicated model while fitting the data.