Data Science & AI

Introduction Skippa

Written by

DSL

Published on

December 30, 2021

Summary

Any Data Scientist is probably familiar with pandas and scikit-learn.
The usual workflow starts with data cleaning in pandas, further preprocessing using pandas or scikit-learn transformers like StandardScaler, OneHotEncoder etc., then you start working with a machine learning algorithm (scikit-learn).
Now there are some problems with this workflow:
1. The development phase in your workflow is quite complex and requires a lot of code ? 2. It is difficult to reproduce workflow for forecasting in the implementation phase ? 3. Existing solutions to reduce these problems are not good enough (yet) ? Skippa is a package designed to:

✨ drastically simplify development
?
pack all data cleaning and pre-processing along with the algorithm into a single pipeline file
?
reuse the interface from pandas and scikit-learn, which you are already familiar with”

Skippa helps you easily define data cleaning and pre-processing transformations.
It works roughly as follows:

from skippa import Skippa, columns
from sklearn.linear_model import LogisticRegression

X, y = get_training_data(...)

pipeline = (
    Skippa()
    .impute(columns(dtype_include='object'), strategy='most_frequent')
    .impute(columns(dtype_include='number'), strategy='median')
    .scale(columns(dtype_include='number'), type='standard')
    .onehot(columns(['category1', 'category2']))
    .model(LogisticRegression())
)
pipeline.fit(X, y)

predictions = pipeline.predict_proba(X)

☝️Skippa does not presume to solve all problems, does not cover all the functionality you might ever need, and is not a highly scalable solution, but it should be able to provide a huge simplification for > 80% of regular pandas/sklearn-based machine learning projects.

Links

You can read the rest of the blog here > Introduction Skippa [ENG]