Fast AI is a free course designed for people with some coding experience, who want to learn how to apply deep learning and machine learning to practical problems.
These are the notes for Part 1 of the course which can be found here
I ended up re-reading the book after going through sections of the course and so I thought I would try to consolidate everything here.
Neural Networks are fundamentally mathematical functions - they take in some input and spit out some output.
We use an objective function - often called a loss function to determine the quality of our mode's predictions. This then allows us to fine tune the parameters of our model.
Important vocabulary to know of is
- The functional form of the model is called its architecture
- The weights are called parameters
- The predictions are calculated from the independent variables
- The results of the models are called predictions
- The measure of performance is called the loss
- Loss depends on the predictions but also the correct label
Ideally, we want to use a pretrained model when working with a new machine learning problem. This is because our model comes out of the box with certain abilities - eg. ability to recognise edges.
The Fast.AI vision learner for instance will always remove the last layer of the pretrained model and replace it with a new layer that has randomized weights. This is because the last layer of the model is the one that is most specific to the task at hand.
When working with a model, we want to make sure that we have a validation set. This is a set of data that we don't train our model on. This is important because we want to make sure that our model is able to generalise to data that it hasn't seen before. This helps to prevent overfitting.
Relevant Portions : Chapter 4
Neural networks work by starting with a series of randomized weights and then using the gradient of the predictions to adjust the weights. This is called gradient descent.
We then adjust the weights by a set scaled amount of the actual gradient. This is known as the learning rate.
Choosing the right learning rate is very important! If we choose one that's too small, we'll never be able to reach a good equilibrium. If we choose one that's too big, we'll just end up bouncing around.
We measure the quality of our prediction using a loss function. It's important here to choose the right loss function. This needs to be a reasonably smooth function without big flat sections and large jumps. This is because we want to be able to use the gradient of the loss function to adjust our weights.
Often times, we'll be working with large datasets. So we perform this optimisation in minibatches - whereby we operate on portions of our data set at a time. This is because it's much faster to calculate the gradient of a minibatch than it is to calculate the gradient of the entire dataset.
Relevant Portions : Multi-Target: Road to The Top p4
We can improve the quality of our predictions by using multi-type models. These are models that are able to calculate predictions for multiple dependent variables
| Model |
| Shared Layer |
| Final Prediction |
The Fast AI example was one whereby we output a 20 element vector where the first 10 corresponded to the disease type and the next 10 corresponded to the variety type.
Relevant Portions : Race to the Top Part 2 which shows an example of TTA.
Test Time Augmentation refers to the process of making predictions on multiple versions of our data and then averaging the results. This is because we want to make sure that our model is robust to changes in the data.
There are a variety of different issues when training models chief of which is overfitting. This happens when our model memorises the data that it has seen in its training set and is unable to generalise to new data. We can observe this sometimes when the loss in our training set decreases but we have no noticable difference in our validation set. There are a couple of ways to avoid this.
We can modify our data to increase the number of samples or make our model more robust to changes in the data.
Potential ways to do so are
- Random Resizing
- Random Crops
Always build an end to end pipline before starting on a deep dive and optimisation of a specific component. When training a new model, what's most important is to look at the data. There's no need to look towards fancy model architectures before you extract the maximum amount of alpha from your data. For instance, Jeremy recommends using the
resnet-18 model as a baseline since it's quick, fast and easy to train.
Relevant Portions: Collaborative Filtering Deep Dive
Weight Decay involves modifying our loss function to penalise large weights. This is because large weights are often a sign of overfitting. We can do so by adding the sum of the square of each of our weights. This can be expressed as
loss=loss + wd*(weights**2).sum()
wdis the weight decay hyperparameter that controls the extent of penalisation of large weights.
Relevant Portions :
Deep Learning is a useful method for dealing with problems without a clear solution. Often times, computers are able to find patterns much better than humans can.
Some examples of areas where deep learning has been applied are
- Computer vision
- Natural Language Processing
- Recomender Systems
Relevant Portions : Chapter 3
Because deep learning is such a powerful tool and can be used for so many things, it becomes particularly important that we consider the consequences of our choices. The philosophical study of ethics is the study of right and wrong, including how we can define those terms, recognize right and wrong actions, and understand the connection between actions and consequences.
Some examples of where deep learning has had unintended consequences are
- Feedback Loops - when youtube's recomendation engine helped to create curated playlists for pedophiles.
- Bias - when google's ad engine displayed ads for criminal background checks when a traditionally african american name was searched for.
Any dataset involving humans can have this kind of bias: medical data, sales data, housing data, political data, and so on. Because underlying bias is so pervasive, bias in datasets is very pervasive