Data for Good

AI & IPA

Written by

DSL

Published on

August 14, 2024

[vc_row type=”in_container” full_screen_row_position=”middle” scene_position=”center” text_color=”dark” text_align=”left” top_padding=”45″ overlay_strength=”0.3″ shape_divider_position=”bottom” bg_image_animation=”none” shape_type=””]

[vc_row_inner column_margin=”default” text_align=”left”]

[vc_custom_heading text=”THE TECHNOLOGY BEHIND THE NEW ENGLAND IPA.” font_container=”tag:h3|text_align:left” use_theme_fonts=”yes”][/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

Making a beer with AI, nice idea!
But how are we going to make this a reality?
Dozens of possibilities and ideas immediately popped into my head.
But which one is the most feasible?
It depends on the user who will be working on the final output of the model.
In this blog, we will elaborate on the technology and models applied during the process and answer the question: how did this data-driven beer technically come about?
The previous blog on web scraping discusses the method of collecting data, which applies as input to the process.
This data consists of online reviews, taken from various online platforms. Features for this include ratings, reactions, beer style and alcohol content.
To arrive at a data-driven beer, we need characteristics of different beers in the data set supplemented by a “predictor.
The latter in our case is a weighting of the average review score with the number of reviews of a beer.
The goal is to train a model, where we seek an optimal weighting of score and number of reviews, with the characteristics and flavors of the beer as features.
We are looking for the optimal features for a beer of a specific beer style.
Some information is almost readily available as features.
Think of the alcohol percentage, the rating or the beer style.
Other things, for example the flavors you taste, are not readily available.
What combination of flavors do well for a specific beer style?
A crisp New England IPA with passion fruit or yet a NEIPA that brings out sweetness and mango?
To make soup out of these many comments in the form of different features, we set to work with Natural Language Processing (NLP).
The intended goal is to get a subsequent set of X number of flavor features for each beer from Brewery het Uiltje, such as Bird of Prey and Dikke Lul 3 Bier.

[/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

[image_with_animation image_url=”5737″ alignment=”” animation=”Fade In” border_radius=”none” box_shadow=”none” max_width=”75%”][/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

We want to extract from this unstructured and large amount of data certain information.
These are certain beer characteristics that are referenced in the reviews of the beer.
This means that a lot of information in a comment is going to be worthless.
For example, in the review where someone gives Uiltje’s beer GoodFeathers an 8/10: “Uiltje’s Brut IPA is somewhat sweet (taste something of apple), dry, more champagne-like than beer. Just the way a brut IPA should be.” This comment contains words like “Uiltje,” “IPA,” “beer,” “brut” and “tasting.
Words we are not interested in.
Such information in this case says nothing about the consumer’s experience.
However, words like “sweet,” “apple,” “dry” and “champagne” are interesting.
Based on the rating score in the review, values can be attached to these words.
To extract just the desired information from a comment, a number of operations must be performed on this data.
This involves the use of a number of Python packages, namely: nltk, scikit-learn, pandas and numpy.
First of all, the comments used in the form of strings must be modified.
This will filter out all irrelevant words, also called stop words.
The package NLTK has standard lists for this available in different languages.
We use the Dutch and English lists and supplement them with some case-specific stop words such as “beer,” “Uiltje,” “IPA,” “Craft,” etc.
In Python, this looks like this:

[/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

[image_with_animation image_url=”5733″ alignment=”” animation=”Fade In” border_radius=”10px” box_shadow=”small_depth” max_width=”50%”][/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

This can be used to create a list of words where the stop words have been filtered out.
However, these are not yet features.
For example, some words will still be very similar, think ‘hop’ & ‘hoppy’.
For those with NLP experience, the solution will make sense, “Mood.
Mood is the reduction of words to the root, for example, making ‘play’ from ‘playing’ or ‘sweet’ from ‘sweetish’.
In addition, some words will contain capital letters, others will not, there will be pieces of text with not just alphabets, etc.
So more ‘preprocessing’ needs to be done on the texts from the comments.
We do this with the following code, using the previously defined function:

[/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

[image_with_animation image_url=”5741″ alignment=”” animation=”Fade In” border_radius=”10px” box_shadow=”small_depth” max_width=”50%”][/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

Now the review comments have been filtered.
What remains to be done now is to identify the most frequently mentioned features for each beer.
In other words, the words need to be ranked. By doing this, we give the words a relative degree to which this word is “important.
To do this, we use TF-IDF (Term Frequency – Inverse Data Frequency).
Quite a mouthful, but what does this mean?
Term Frequency (TF) is the number of times a specific word occurs relative to the total number of words.
Inverse Data Frequency (IDF) is the logarithm of – in this case – the number of reviews divided by the number of reviews where a specific word is used.
TF-IDF is then the product of both.
Well, after this part of theory, we can move on toward the data-driven beer.
Scikit-learn fortunately has a number of TF-IDF features available, this is what we are taking advantage of.
For each beer, we want to determine a list of key words from comments.
To realize this we create three functions: one that creates a ‘Bag of Words’ and converts this into a TF-IDF representation, one that determines the top words for a piece of text and one that determines the top words for a group.

[/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

[image_with_animation image_url=”5743″ alignment=”” animation=”Fade In” border_radius=”10px” box_shadow=”small_depth” max_width=”100%”][/vc_column_inner]

[image_with_animation image_url=”5745″ alignment=”” animation=”Fade In” border_radius=”10px” box_shadow=”small_depth” max_width=”100%”][/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

These functions can be used after passing the comments through the ‘preprocess’ function defined earlier.
This allows us – given a Pandas data frame with reviews – to return a list of data frames with the top TF-IDF words per beer.
For example, for the beer ‘Dr. Raptor’ from the Owl we find the following words as the highest scoring TF-IDF words:

[/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

[image_with_animation image_url=”5750″ alignment=”” animation=”Fade In” border_radius=”none” box_shadow=”none” max_width=”75%”][/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

In code, this looks like this:

[/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

[image_with_animation image_url=”5753″ alignment=”” animation=”Fade In” border_radius=”10px” box_shadow=”small_depth” max_width=”50%”][/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

With this, we are there!
At least we have managed to extract the most meaningful characteristics per beer from the comments.
An optimal beer is not there yet.
These features – along with already known features like alcohol content, beer style, IBU and rating – can now serve as features in the remainder of this process.
Here we include the TF-IDF words of each beer as a feature, with the score as the value.
If a beer has no significant TF-IDF score for a particular word, that feature is given the value 0 for that beer.

[/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

[image_with_animation image_url=”5754″ alignment=”” animation=”Fade In” border_radius=”none” box_shadow=”none” max_width=”75%”][/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left” css=”.vc_custom_1621955489530{margin-top: 10% !important;}”]

So now what?
We start modeling!
What we have to do is to determine the flavor combination that the optimal beer should have.
This is done by using the data in a supervised regression model, namely a random forest.
Again, we use a function from the Scikit-learn package for this purpose, namely RandomForestRegressor.
This function is used in its own function to determine and plot the optimal flavor ratios.

[/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

[image_with_animation image_url=”5755″ alignment=”” animation=”Fade In” border_radius=”10px” box_shadow=”small_depth” max_width=”50%”][/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

This model represents the proportions of the best possible flavor combination.
What is important here is that these are proportions.
The highest scoring characteristic should not be excessive.
The combination and ratio of flavors is what makes it optimal.
This ratio is passed on to Brewery het Uiltje as input for the recipe.
What flavors these are for a New England IPA you can see below.

[/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

[image_with_animation image_url=”5756″ alignment=”” animation=”Fade In” border_radius=”10px” box_shadow=”small_depth” max_width=”100%”][/vc_column_inner][/vc_row_inner][vc_row_inner column_margin=”default” text_align=”left”]

That completes the technical part!
A data-driven beer which is now in the hands of the brewers.
In the coming period they will be working to make the beer as consistent as possible with the results of the data.
Are you curious about this New England IPA?
Keep an eye on us for updates on the production process!

[/vc_column_inner][/vc_row_inner]

[vc_row type=”full_width_background” full_screen_row_position=”middle” equal_height=”yes” content_placement=”middle” scene_position=”center” text_color=”dark” text_align=”left” top_padding=”5%” overlay_strength=”0.3″ shape_divider_position=”bottom” bg_image_animation=”none” shape_type=””]

[vc_row_inner column_margin=”default” text_align=”left” css=”.vc_custom_1615891372117{margin-left: 5% !important;}”]

[nectar_highlighted_text highlight_color=”#00c588″ style=”half_text”]

Questions?

[/nectar_highlighted_text][/vc_column_inner][/vc_row_inner]

[vc_row_inner column_margin=”default” text_align=”left”]

[vc_custom_heading text=”Boudewijn Gresnigt” font_container=”tag:h4|text_align:left” use_theme_fonts=”yes”][vc_custom_heading text=”boudewijn.gresnigt@datasciencelab.nl” font_container=”tag:p|text_align:left” google_fonts=”font_family:Raleway%3A100%2C200%2C300%2Cregular%2C500%2C600%2C700%2C800%2C900|font_style:400%20regular%3A400%3Anormal” link=”url:mailto%3Aboudewijn.gresnigt%40datasciencelab.nl|||” css=”.vc_custom_1615891099905{margin-top: -2% !important;}”][vc_custom_heading text=”+316 28 47 67 67″ font_container=”tag:p|text_align:left” google_fonts=”font_family:Raleway%3A100%2C200%2C300%2Cregular%2C500%2C600%2C700%2C800%2C900|font_style:400%20regular%3A400%3Anormal” link=”url:tel%3A%2B31628476767|||” css=”.vc_custom_1615891040783{margin-top: -7% !important;}”][/vc_column_inner][/vc_row_inner]

Questions? Please contact us

Blog

This is also interesting

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Responsible AI: From Principles to Practice

Artificial Intelligence is evolving at a rapid pace. New models are released almost every week, and more and more organizations are experimenting…

Peter Blauwhoff and André Hendriks appointed to Advisory Board

Data Science Lab (DSL) appoints Dr. Peter Blauwhoff and Drs. André Hendriks MBA to its Advisory Board. With their accession, DSL brings…

GACS explained: why building data analysis is essential for energy management

Buildings are generating more and more data. Energy flows, climate data and installation statuses are often already available. Yet in practice we…

Sign up for our newsletter

Do you want to be the first to know about a new blog?