My First AI Model

After reading a paper, Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks by Finn et al. (2017), I was inspired to utilize some of the knowledge I gained from researching the topics in the paper to create my own AI model.

As I was starting from the basics I decided on the most simple architecture I could come up with, a single linear layer to a binary output to classify an email as spam or legitimate.

My model works like the above diagram, first scanning a dataset of ~5000 spam-sorted emails and compiling a list of the top 6000 most used words. In each training data batch, the count of each of the words of this would be fed into 1 of 6000 corresponding input nodes seen on the left.

In order to train, I divided the dataset into ~80% training data, 10% validation data, and 10% test data. The training data was divided into uniform batches to be given to the model each epoch, or training cycle. I then used the validation data to evaluate the model at the end of each epoch to give me a sense of the accuracy trend over time. Finally, the test data was used at the end as a check of the final model’s capabilities.

Now, diving into the math, there were three main equations I utilized. The first of these was the forward pass of the model.

Each input node from earlier has a weight, or parameter, which is multiplied by that corresponding node’s input, summed with all other nodes and has a bias added to it to produce the model’s unnormalized output.

I chose to use the sigmoid function, shown above, to normalize the output, as it has a cool property of squishing all values into the range (0,1). This single decimal value is the model’s output, being greater than a set threshold if it detects a spam email and lower than that threshold otherwise.

Up until now, I’ve been mentioning training a lot, so now let’s dive into exactly how I implemented backpropagation to optimize my model. The main idea behind backpropagation is first representing the loss as a function of the weights and bias, then using the gradient operator to find the local minimums of that function and nudge the parameters in that direction.

Source: Steemit

As seen above, the gradient of a function results in a vector of the partial derivatives of f with respect to each of its inputs. For some starting value of the loss function, that vector would point towards the nearest local maximum, or the red values seen in this graph.

Source: ML-DL

By simply multiplying each component by -1, we can make the vector point to the local minimums, or the blue areas. In terms of loss, or error, this minimizes it, so it makes the model more accurate over time. In this case, I have represented it as a 3d graph, but in reality the dimension of the loss function is too high for human comprehension, although the same principles still apply.

Going back to our unnormalized output, we leave out the summation for simplicity with sigmoid applied for normalization, and we define our loss function to be binary cross entropy loss, seen below.

While this equation may look complex, it boils down to a rather simple concept. Here, y represents the intended model output, either a 1 for spam or 0 for a legitimate email. Because of this, only one of the two segments will trigger each time it is used, hence the name binary. The natural log is used to create a larger negative value and thus punish the model more for larger error from the goal. Finally, the losses for all cases in the batch are averaged and multiplied by -1 to provide positive loss.

Going back to the gradient, we then want to calculate the partial derivative of the loss with respect to the weight in order to find the value to update the weight by. This can be accomplished by using the chain rule to break it into partial derivatives we can find the answers to.

Solving for all three partial derivatives using our current equations and combining the results in a formula for dL/dw that I apply to each weight every training batch, with x representing the input of the node corresponding to each weight.

The final part of my model is now the training loop, which runs through the entire training data each epoch and returns the loss and accuracy of a test case that it not used to update parameters at the end of each epoch.

The final results were quite good, especially for this relatively simple and straightforward approach. After training for 100 epochs, my model achieved 96.6% accuracy on the test dataset, but it scored low in other areas such as its 78.7% recall (100% – false negative rate), so I decided to continue improving it.

The main difference between my model and the more common archetype was the lack of hidden layers, extra rows of neurons each with a parameter between the input and output. So in my next version, I added two hidden layers, turning the structure of my model into something like this.

Source: ResearchGate

Computing the partial derivatives for the parameter updates follows the same exact thread as before, except with each x being replaced with an A, representing the output of a previous layer.

With this new update, my model performed significantly better, with 98.7% accuracy, 97.1% precision (100% – false positive rate), and 90.7% recall. With that, I decided my project was performing sufficiently enough to be called completed.

All of this model was coded and trained in VSCode using only NumPy, with 0 external AI libraries. While it was a bit tedious at times, I found that it really helped me get a feel for how the underlying math and data storage worked. From here, I plan to start working with PyTorch to speed up my design process, now that I have a good grasp of the basics.

Look forward to seeing more updates on my AI journey, right here on this website. Thanks for reading!


Comments

Leave a Reply

Discover more from VJ's Field Notes

Subscribe now to keep reading and get access to the full archive.

Continue reading