I’m starting new blog series designed to highlight and explore the security challenges in the AI world.
Before I dive into this AI thing, I wanted to do a basic tutorial, a high overview on how machine learning works.
Before doing anything, you need to understand the data. Where does it come from and what it represents.
Machine learning is powerful tool, but at the end of the process, you have to look at the result and define the success of the project.
This means answering the following questions:
- Does this result make any sense?
- Is this result correct?
- Can we do better?
Answering any of these questions requires some familiarity with the data.
Garbage in - garbage out, that’s how it works.
The success of the model, depends of the quality of the data.
Make sure the data you have is in a format that is suitable for you to work with.
When we train the model, we always assume the data we have represents the real world. But that’s not always the case, mistakes happen. That sensor data can have false reading from broken sensor, the medical data can be misread by the ocr software. Mistakes happen, you need to be ready for them.
We don’t always have all the data we need. Assume we are working with data from medical charts. Attributes such as name, height are present, but the weight attribute is missing. We need to deal with this because some algorithms don’t know how to deal with missing attributes. Do we put 0 for the weight attribute, or do we take the average?
If you decide the latter, the average of what? The average weight for all patients? The average weight of patients of the same sex, or the average for the same age group? Maybe both? Different choices may give you different results, the correct one depends on what you’re trying to achieve and the algorithm you’re using.
Another problem we might face is inconsistent data.
- Age = “42”, Birthday = “03/07/2016”
- Was rating “1, 2, 3”, now rating “A, B, C”
Having attributes with different data ranges can confuse some of the machine learning algorithms.
The easiest way to deal with this, is rescaling the range of attributes to scale the range in [0, 1] or [−1, 1].
This step is optional and depends on your data and the model you’re gonna use.
When talking about machine learning algorithms we often talk about 2 groups, supervised and unsupervised learning. Supervised learning means that, we have the correct answer and the answer will be given to the learning algorithm. For unsupervised learning algorithms we don’t have, or we don’t give the answer, instead we throw the data and analyse the results.
The data goes through the algorithm and the end result is a model.
Using the example with medical data charts lets assume we want to build a model that can detect cancer.
Supervised learning algorithm would get the data along with labels such as cancer / not cancer, wheres a unsupervised learning algorithms would get the data and simple goal such as, divide this data into 2 categories.
Another decision that has to be made is the trade off between complexity and performance. Complex models give better results but they are slower. We can wait few minutes for the result of the cancer detection model, we can’t do the same with malware detection where we need to scan all files accessed by the os.
The success of the model depends on few things, the knowledge of the training algorithm, the quality of the data and the amount of data.
The feature selection process deals with removing the noise / unnecessary attributes from the data.
By removing unnecessary attributes from the data we get better performance, more robust models and easier interpretation of the results.
Remove the attributes that don’t give us any information. This includes things such as unique identifiers, auto generated ids and timestamps.
Attributes that are related together can lead the algorithm in the wrong direction. Highly correlated attributes should be removed.
There are three general classes of feature selection algorithms: filter methods, wrapper methods and embedded methods.
Filter feature selection methods apply a statistical measure to assign a scoring to each feature. Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. Embedded methods learn which features best contribute to the accuracy of the model while the model is being created.
An Introduction to Feature Selection by machine learning mastery is good article on this topic.
The technique used for testing depends on the size of the data, time and the available resources.
Hold-out is good for big data, k-fold for medium sized data sets and leave-one-out is for small data sets.
Split the data into 2 sets, one for training and another one for testing. The sets don’t have to be equally sized.
Divide the data into k sets. One of the will be used as test set, the other k-1 will be used for training. Repeat the process k times.
The standard is k=10.
Take one item out of the data set and use it for test. The rest of the data set will be used for training.
Do the training and testing, then return the item to the data set.
Repeat for all items in the data set.
When it comes to supervised classification algorithms an easy and practical way is to build a confusion matrix.
Bellow is example of such matrix for classifier with 3 classes.
Another trick I found useful when evaluating different models is to assign misclassification cost for every class combination and use those numbers when comparing models.
Evaluation of unsupervised learning algorithms depends on what you’re trying to achieve. If we decide to go with clustering algorithm we would be looking at the cluster structure and trying to answer questions such as:
- Do clusters have items with similar properties?
- Is the data properly divided?
- How big are the clusters?
- How similar are the clusters?
The goal of this post is to give you a high overview of how machine learning works. The inner workings of different algorithms are intentionally left out, explaining them in this context will only confuse you further. This will be explained in the future blog posts, where I will show how the selected algorithm works, and then the security issues associated with it.
The main things I want you to take away from this blog post are:
- The success depends on the quality of the data.
- Quality data means carefully selected samples which represent the real world.
- Feature selection and data preparation are the 2 most important steps.
- It’s trial and error process.
- Algorithms are not a black box, knowing about the complexity and performance can lead to better decisions when trade-offs have to be made.