Underfitting & Overfitting


Error due to Bias – Accuracy and Underfitting

Bias occurs when a model has enough data but is not complex enough to capture the underlying relationships. As a result, the model consistently and systematically misrepresents the data, leading to low accuracy in prediction. This is known as underfitting.

Simply put, bias occurs when we have an inadequate model. An example might be when we have objects that are classified by color and shape, for example easter eggs, but our model can only partition and classify objects by color. It would therefore consistently mislabel future objects–for example labeling rainbows as easter eggs because they are colorful.

Another example would be continuous data that is polynomial in nature, with a model that can only represent linear relationships. In this case it does not matter how much data we feed the model because it cannot represent the underlying relationship. To overcome error from bias, we need a more complex model.1

Error due to Variance – Precision and Overfitting

When training a model, we typically use a limited number of samples from a larger population. If we repeatedly train a model with randomly selected subsets of data, we would expect its predictons to be different based on the specific examples given to it. Here variance is a measure of how much the predictions vary for any given test sample.

Some variance is normal, but too much variance indicates that the model is unable to generalize its predictions to the larger population. High sensitivity to the training set is also known as overfitting, and generally occurs when either the model is too complex or when we do not have enough data to support it.

We can typically reduce the variability of a model’s predictions and increase precision by training on more data. If more data is unavailable, we can also control variance by limiting our model’s complexity.

發生overfitting 的主要原因是:

  • (1)使用過於複雜的模型;

  • (2)資料噪音;

  • (3)有限的訓練資料。




隨機噪音與確定性噪音 (Deterministic Noise)

之前說的噪音一般指隨機噪音(stochastic noise),服從高斯分佈;還有另一種“噪音”,就是前面提到的由未知的複雜函式f(X) 產生的資料,對於我們的假設也是噪音,這種是確定性噪音。

資料規模一定時,隨機噪音越大,或者確定性噪音越大(即目標函式越複雜),越容易發生overfitting。總之,容易導致overfitting 的因素是:資料過少;隨機噪音過多;確定性噪音過多;假設過於複雜(excessive power)。



  • (1) 隨機噪音 => 資料清洗

  • (2) 假設過於複雜(excessive dvc) => start from simple model

  • or

  • (3) 資料規模太小 => 收集更多資料,或根據某種規律“偽造”更多資料 正規化(regularization) 也是限制模型複雜度的(加懲罰項,對複雜模型進行懲罰)。

資料清洗(data ckeaning/Pruning)

將錯誤的label 糾正或者刪除錯誤的資料。

Data Hinting: “偽造”更多資料, add “virtual examples”


Underfitting vs. Overfitting

This example demonstrates the problems of underfitting and overfitting and how we can use linear regression with polynomial features to approximate nonlinear functions. The plot shows the function that we want to approximate, which is a part of the cosine function. In addition, the samples from the real function and the approximations of different models are displayed. The models have polynomial features of different degrees. We can see that a linear function (polynomial with degree 1) is not sufficient to fit the training samples. This is called underfitting. A polynomial of degree 4 approximates the true function almost perfectly. However, for higher degrees the model will overfit the training data, i.e. it learns the noise of the training data. We evaluate quantitatively overfitting / underfitting by using cross-validation. We calculate the mean squared error (MSE) on the validation set, the higher, the less likely the model generalizes correctly from the training data.

overfitting vs underfitting

                            underfiting vs. fitting vs. overfitting

Learning Curves

Learning Curves

A learning curve in machine learning is a graph that compares the performance of a model on training and testing data over a varying number of training instances.

When we look at the relationship between the amount of training data and performance, we should generally see performance improve as the number of training points increases.

By separating training and testing sets and graphing performance on each separately, we can get a better idea of how well the model can generalize to unseen data.

A learning curve allows us to verify when a model has learned as much as it can about the data. When this occurs, the performance on both training and testing sets plateau and there is a consistent gap between the two error rates.


When the training and testing errors converge and are quite high this usually means the model is biased. No matter how much data we feed it, the model cannot represent the underlying relationship and therefore has systematic high errors.


When there is a large gap between the training and testing error this generally means the model suffers from high variance. Unlike a biased model, models that suffer from variance generally require more data to improve. We can also limit variance by simplifying the model to represent only the most important features of the data.

Ideal Learning Curve

The ultimate goal for a model is one that has good performance that generalizes well to unseen data. In this case, both the testing and training curves converge at similar values. The smaller the gap between the training and testing sets, the better our model generalizes. The better the performance on the testing set, the better our model performs.

Model Complexity

The visual technique of graphing performance is not limited to learning. With most models, we can change the complexity by changing the inputs or parameters.

A model complexity graph looks at training and testing curves as the model’s complexity varies. The most common trend is that as a model’s complexity increases, bias will fall off and variance will rise

Scikit-learn provides a tool for validation curves which can be used to monitor model complexity by varying the parameters of a model. We’ll explore the specifics of how these parameters affect complexity in the next course on supervised learning.

Model complexity
隨著模型複雜的上升,模型對資料的表徵能力增強。但模型過於複雜會導致對training data overfitting,對資料的泛化能力下降。

Learning Curves and Model Complexity

So what is the relationship between learning curves and model complexity?

If we were to take the learning curves of the same machine learning algorithm with the same fixed set of data, but create several graphs at different levels of model complexity, all the learning curve graphs would fit together into a 3D model complexity graph.

If we took the final testing and training errors for each model complexity and visualized them along the complexity of the model we would be able to see how well the model performs as the model complexity increases.


Learning curve of overfitting