Data Science & AI, NLP

Automotive

Written by

DSL

Published on

August 11, 2021

In recent years, the automotive industry has had a great deal of data, in many forms.
From simply an inventory of specific cars, to sensor data in the car for the purpose of gathering information.
More and more car manufacturers are using this to develop smarter and more efficient cars.
This translates not only into the futuristic form of a self-driving car, but also into other features of a car.
Consider, for example, sensors that can indicate in time when certain parts are due for inspection or replacement.In addition to car manufacturers, there are obviously other organizations that are in this industry that could harness the power of data.
Consider leasing companies, car dealers or importers.
They are all at different stages in terms of data collection and use.
In short; big leaps are being made with data and Artificial Intelligence (AI).
In this blog, we cover four data science use cases that can be relevant and valuable to organizations within the automotive industry.

Calculating residual value using Tweets

The residual value of a leased car is important information for leasing companies because this value says a lot about the lease rates the company charges its customers.
An accurate calculation of the residual value of a new car can help the leasing company choose an appropriate lease rate to maximize profits.
There is a vast amount of information available on the Internet these days.
Consider general sentiments, news articles or posts on Internet forums.
These can give an indication of the popularity of certain cars or car brands.
The large amount of text data can be used in combination with existing data on cars that the leasing company already has.
Think of the make, engine size, number of doors, etc. to use machine learning models to predict the most accurate residual value possible.
The data that will be used must first be converted to numbers.
This is important because raw text with words and letters cannot be used as input to the machine learning model.
For NLP purposes, word embeddings often work well as text representation.
The most well-known word embedding models are word2vec, fastText and GloVe.
For choosing the type of machine learning model, neural networks are often a good choice for NLP questions.
The Long Short-Term Memory (LSTM) is a well-known method to use, but a Convolutional Neural Network (CNN) is also a neural network that does well in text-based prediction.
Finally, choosing a good timeline of the text data is important.
To calculate the residual value of a car on Sept. 1, 2021, we can include all social media, car forums, or news stories from the past month in our model.
However, we don’t know if one month is enough and may need at least 2, 3, or 4 months.

Canonicalization

A well-known problem within organizations is that data is not standardized.
This means that each unique value within a field is not described the same everywhere.
If “Mercedes-Benz C Class” and “Mercedes Classe C” appear in a dataset, we can easily see that this is the same type of car, whereas in the database or by the computer it is usually considered two different objects.
So for modeling the data and creating insights, this will cause some inconsistencies.
So to make robust prediction models and create correct insights, we want all data to be unambiguous.
We want to convert all “non-clean” data to a standardized name.
To do this, we must first know how we ideally want to describe the data, and thus create a so-called standardized list.
This standardized list is a list of all brand and model names, written as you would ideally want them to be.
If “Mercedes-Benz C Class” is a default name in that list, then we want to convert all data points called, for example, “Mercedes C-Class,” “Mercedes C220,” “Benz Class C,” “Mercedes-Benz C220d Automatic,” to the default name.
Manual enhancements of the data we want to minimize so we are looking for a method to automate the standardization process.
We could approach this process in several ways.
One simple but effective possible method is to use fuzzy string matching.
This allows you to match the input data, in our case the brand and model names you would like to convert, with the names from your standardized list.
Fuzzy string matching is a method that matches two texts that are not exactly, but approximately or partially similar.
One measure to determine the similarity between two pieces of text is the Levenshtein distance.
This measure essentially represents how many characters must be matched to make the two pieces of text similar to each other.
An advantage of fuzzy string matching is that it is a relatively simple way to implement.
Unlike classical machine learning models, fuzzy matching does not require you to train the data.
Pieces of text can be matched directly with each other.
This also means that points that do not appear much in the data can also potentially be successfully matched, where with machine learning models you often need a lot of data before certain labels can be classified.
A disadvantage of fuzzy matching is that texts do need to be somewhat similar before they can be matched.
For examples such as “C-Class,” “C Classe” and “Class C,” we can be fairly confident that a successful match can be found.
However, when we use examples like “C220” and “C-Class,” it quickly becomes very difficult to find a match even though they are the same model name of a car brand.
The previous example is well suited for classical machine learning models.
Such models, such as a Random Forest, are often very successful in finding links that are not so obvious at first glance. With enough training data, a Random Forest could quickly learn that a “C220” is a “C-Class.”
In addition to tree-based models like the Random Forest, or XGBoost, neural networks like LSTMs are also often successful on these types of issues.
In addition to choosing the type of prediction model, the first thing to consider is how we want to represent the words in a text.
One way to do this is to use a TF-IDF matrix.
In this matrix, each word in the dataset appears, with each word being a numeric number.
This number is a representation of how “important” a word is in a piece of text, compared to all texts in the entire dataset.
As mentioned earlier, an advantage of this prediction model is that they are often very accurate, provided there is enough data (and this requirement can therefore be considered a disadvantage right away).
They are often less good at finding outcomes that are infrequent in the data or completely new.
Periodic re-training of the model is therefore often necessary.

Object Detection: damage recognition

An accident is in a small corner, we have all heard this often enough.
Driving a car involves a lot of risk to accidents, and when we look at the vehicle itself, we see high damage costs.
For leased cars, the liability on damages is often on the leasing driver himself.
Checking damage at the end of the lease term is a process that often still contains many manual steps and has great potential for automation.
Object detection algorithms such as R-CNN and YOLO can be used to detect new damage using machine learning and cameras.
A new lease car can be inspected with cameras before it is delivered to the lease driver to determine the new condition of the car.
At the end of the lease term, the car can again be inspected by cameras, this time to compare the car to its new condition.
As a result, it can be discovered quickly and with as little manual intervention as possible whether a leased car has been damaged during the lease contract.
Based on this, an immediate indication of the potential repair costs of the damage can be given.

Detecting fuel card fraud

Many employers offer their employees the option of a fuel card, which can be used to easily pay for the employee’s business mileage.
This eliminates the need for the employee to keep track of and declare business mileage each time afterwards.
Unfortunately, this does come with a risk of fraud.
A fuel card can be skimmed by criminals, allowing fuel to be paid for at great expense to the employer.
But fraud can also be committed by the employees themselves, for example by periodically lending the fuel card to family members or friends who can use it.
In order to detect this type of fraud in time to avoid high costs, we can use machine learning methods that predict whether a particular transaction is fraudulent or legitimate.
Within the data science field of work, fraud detection is a common issue.
Often these fraud detection solutions focus on credit cards, but of course the underlying techniques can be used for any type of transaction provided enough data is available.
Using algorithms such as the XGBoost, we can detect potential fraud when enough labeled data is available.
Even when no labels are available, we can use unsupervised methods such as the Isolation Forest and Random Cut Forest to detect anomalies (fraud) in the transaction data.

Curious about the possibilities?

Does your organization handle a lot of repetitive queries?
Do you want to get started with a data science project or are you curious which data-driven solutions are right for you?
Feel free to contact us to explore how we can realize your data-driven goals.
Together we create the future.