The applicability of deep learning models has been growing rapidly over the last years, especially in the domain of image recognition. If you are interested in the basics of deep learning, see for example Nahua Kang’s excellent article.
In this post, we aim to clarify how such a deep learning model can be created when there is a limit in time and resources. Further, we show how a deep learning model can be analyzed to check if it is actually doing what you target it to do and aim for the reader to get a glimpse of how you can apply deep learning techniques to an image recognition challenge of your own. Code to create a deep learning model can be found here and code to create Lime images here, for those who want to skip ahead.
For context, this guide is an extension on the project we as Data Science Lab did for the Big Data Expo 2018. For this exposition, the question surfaced whether we would be able to create an interactive personalised game within 4 weeks. The result of this brainstorming session was to re-create the game ‘Guess Who’, famous to at least most Dutch people from childhood.
As a quick refresher: the goal of the ‘Guess Who’ game is to guess who your opponent’s character is, by asking yes or no questions (e.g. is your character a male ?). Requirements for this challenge were given set circumstances, that the models would be finished in 4 weeks (since our limitation in time for the expo) and that there would be enough labels to play the game (i.e. gender should be recognized and therefore be a potential question). As a number for the labels, we came to a minimum of 6, since the game usually ends up in simply guessing who your opponent’s character is after 4 to 5 questions.
Figure 1: The Original Guess Who game
To tackle this challenge, we divided it into two parts: (A) to create a working game and (B) to be able to play the game with your own custom character. The crux would be to get a computer to recognize what facial characteristics someone has, so that actual questions like “is your character a male”, could be asked. Further, in this post we hope to clarify how such models can be created and show you how transfer learning can be used to save a lot of time and resources too. Part B will be discussed here and A in a different post.
As with all machine learning tasks, the first step of our challenge is to define our problem and research how we could aim to solve it. We played around with different questions in-game, and keeping into consideration user-friendliness and achievability, came up with the following potential features to predict. These features were roughly based on questions, we ourselves would find likely to ask inside the original game:
Glancing quickly at these features, we can quickly note that we can easily define each as a classification problem: i.e. is your person a male or is your character wearing a hat?
For simplicity, we chose to create a classification model for each individual feature. We chose for this option, since creating one overall multi-classification model, which could create predictions for all of these features in one go, would require a quite complex dataset (or smart workarounds which would require extra time). With only two or three classes per model, it remains relatively simple to balance dataset distributions (and simplify learning).
Since earlier research pointed towards deep learning models, our focus quickly moved towards utilizing such techniques. In addition, in order to save time, we aimed to use transfer learning. In transfer learning, you aim to utilize previous well performing solutions and extend these to work for your own challenge. Quite easily, we were able to find a guide which modified a Tensorflow tutorial (provided by the Tensorflow team). The detailed guide can be found here). The idea is to utilize the pre-trained Inception-V3 network and change it to solve our own image classification problem. Note that the guide linked above refers to a guide on creating a multi-label classifier (labelling classes which are optionally unrelated). We used this framework and edited it to be an exclusive classifier again (i.e. you can only have glasses or not, not somewhere in between). We chose to continue with that repository because it offered a simple structure, which allowed us to easily and rapidly create custom models. More on this later.
We therefore replaced
final_tensor = tf.nn.sigmoid(logits, name=final_tensor_name)
final_tensor = tf.nn.softmax(logits, name=final_tensor_name)
and similarly for entropy:
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(logits,ground_truth_input)
cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=ground_truth_input)
The strength of this solution lays in utilizing a pre-trained model, in this case the Inception-V3 network, to get a head start in training a model to recognize something from an image. Note that this is not always a road to the best potential solution, but in general a nice way of starting. The idea is to use the network depicted below, which is trained on recognizing 1000 different objects, and instead of retraining it entirely, only retrain a Softmax classifier on the top layer (a 2048 dimensional vector). Now don’t get scared of these terms, this only means that instead of retraining millions of parameters, we only retrain n+n*2048 (in the order of thousands). As you might guess, that saves a lot of computational time! Examples of others that shared this approach aimed to for example recognize products.
A (very) simplified explanation is that this model (Inception-V3) is already good at recognizing basic things like objects and shapes (circles, facial boundaries, color differences etc.) and a sequential step to more higher order features (combination of such shapes) therefore becomes easier. Even simpler: before us humans learn to walk, we need to learn how to stand and keep our balance too! Already knowing those things should help us in the process of learning to walk too!
Figure 2: The Original Inception-V3 network. As you can imagine, coming up with such a network (laying and connecting all the individual pieces and deciding on structure in general) requires quite a lot of time and testing!
A different benefit of using transfer learning is that thinking about network structures is also no longer necessary. As a result, the only requirements are gathering data and tuning hyper parameters. Potential parameters that could be modified included:
Also we could perform image Augmentation (changing input images to reduce noise):
We will share our experience with these hyper parameters later on. First and most crucial in our process, we show how we gathered our data.
The greatest challenge in creating these models was gathering data. Since we are limited in colleagues and therefore potential combinations of features (i.e. short, long hair, hair colours etc.), we resulted in gathering images from the internet. A downfall here is that there are great differences between images you find on the internet. For example, images of people with glasses are completely different from people without glasses. Speaking from experience: one group (people with glasses) could have more people looking sideways, busy backgrounds, a bigger distance between the individual and the camera and so on compared to our contrasting group (people without glasses). Limiting these differences between groups was key, because they could result in great bias in our models! It could result in our models focusing on the complete wrong parts of an image (i.e. looking sideways would result in a very high chance of you not wearing glasses or short hair!). In a nutshell here are some tips that helped us a lot:
We had models starting to show an ‘ok’ performance (70–80% ) at 100 images per category (!) and in our later runs used approximately 1000 images per category (e.g. 1000 men vs 1000 women when predicting gender). Overall results depended on what we aimed to predict, but results are shown later in Figure 4 if you want to skip ahead.
After the data gathering step was completed, we could train our predictive models based on the strategy explained before. With a goal of optimizing time spent, we only tweaked code slightly with the biggest adjustment making it easier to label images. After some tweaks, the only steps that had to be performed to move from training to predicting included:
- image_files: (used for labelling) - hat: all the images of people with a hat - none: images of people without a hat - image_labels_dir: contains all the labels per above image
- images: (used for training) - multi-label: contains all images in one folder
** optional ** - images-cropped: - multi-label: contains cropped images in one folder for training
For a slightly more elaborate explanation, we refer the interested reader to our Github.
Most time here obviously lays in step 1 where data first has to be gathered. Labelling (step 2) was done by a script and training (3) did not take long either (our last models were finished in ± 10 minutes on a 2,3 GHz Intel Core i5 with 8gb memory Macbook Pro!).
In terms of hyper parameter tuning and results, here are our conclusions from this project:
In order to make sure that our models were representative and would show accurate results at the expo, we created a testing set. For those who are unaware; a testing set refers to a set of photos which our models have not seen before and can therefore be used to rate how well our models perform. This set included our colleagues with various attributes (fake bears, hats and glasses etc.). An impression of such photos is provided below in Figure 3. Results of our models are highlighted afterwards in Figure 4.
Looking at our 8 models and their corresponding accuracies, we can group them in three categories:
Ones that performed very well (85%+):
Ones that performed ‘ok’ (75–85%):
Ones that were most of the time correct (65–75%):
Figure 3: Test images that we created to check performance of our models.
Figure 4: Results at the Expo. Blue referring to our last test set scores and green how well we performed at the expo.
To our joy, all models performed more or less in line with expectations given the relatively small number of samples (n=37). Since our utilized models map input pixels to a predicted output (i.e. male or female) and given the great amount of noise in our data (photos from the internet), we aimed to find out how our models made predictions. Did they actually learn to recognize glasses, or did they simply find hidden correlations (i.e. looking sideways as being an indicator for having glasses)?
Deep learning has long been seen as a black box technique. Information goes in, in our case in the form of an image, and a predictive score is formed as an output. Whatever goes on inside the network is often a mystery. Pixel values are analyzed, information is added or removed and some feature space is learned. The network will find ways to link an input (our image’s pixels) to an output (someone is or is not wearing glasses). Opening up specific nodes within a network have led to some interesting areas of research like style transfer.
A different approach which we used, is taken by the creators of the Lime project. In this project, the authors propose to analyse, how a change in pixel values, changes the output of our predictive function. Pixels are randomly permuted (see Figure 5) to correspondingly analyse which pixels have most influence on our prediction. Afterwards, the most important pixels can be plotted in the form of a ‘mask’. For mask images see Figures 6 and 7.
Figure 5: Lime randomly flipping pixels to analyze how this affects a prediction.
Interestingly this technique can show us where our models focused on when making a prediction. We can check if our models actually learned to predict glasses by looking at the area around people’s eyes, or by simply finding hidden connections and cheating us mere humans. For this demo we took two people who were not in our training sets. Two dutch legends which are our king Willem Alexander and Olympic athlete Dafne Schippers (Sorry Maxima but you were already in our training set). As for parameter settings we created 10k samples and visualised only the top 10 features with a minimum weight of 0.03.
The code provided by the creators of Lime was modified slightly for it to work with raw image files (which our model requires, in contrast to raw pixel values). Code can also be found on our Github.
Interestingly we can compare this to our own (human) expectations. How we classify whether someone has glasses, a hat or a tie is easy. Classifying when someone has curly or straight hair, or light or dark for that matter, more difficult. To explain how we classify, whether somebody is a man or woman is actually quite difficult! Do we look at face proportions, hair, facial hair, a jawline? Facial hair might be seen as a male trait but men without it exist. For that matter, so do women exist with male facial characteristics (strong jawline, short hair etc.). To say the least, these images give an interesting insight into how a model learns to map pixels to a classification.
Figure 7: Lime images indicating important areas when predicting specific features for king Willem Alexander.
Given these are only two images, we can take notes on the results. We can see that the hat, tie and glasses model not only focus at the correct areas but also on other less intuitive regions. Without being bound to human biases or expectations, the models are keen at finding hidden relations. I myself for example am unaware of the relation between a throat and wearing a hat, a forehead and facial hair, nor how someone’s jaw looks and them having straight hair. Understanding whether these relations are actually good predictors or simply based on bias in our photos, is something that could be researched next. These relations however showcase how taking advantage of such techniques can give nice insight into whether our models are biased or not!
With current available libraries it has become a lot easier to create a deep learning model. We have shown how, in only a couple of weeks and even with a noisy self constructed dataset, quite nice accuracies can be achieved. In addition, we have shown how relatively new techniques like the Lime project, can be used to visualise why models make certain predictions. These insights can hint at how you can track bias in your predictive models and how you can enhance results and generalisability further. It’s therefore not unsurprising that this part of research one of the bigger trends of the 2018 AI summit.
I hope you learned something from this post and if you have any feedback, suggestions or questions, don’t hesitate to reach out to me. Also keep an eye out for post B. In that post we will provide an overview of how to create a web application in Dash that can make use and host the models created!
Writer: Ruben Sikkes, Data Scientist @ Data Science Lab.