Data Science & AI

Introduction Skippa

Written by
DSL
Published on
December 30, 2021

Summary

Any Data Scientist is probably familiar with pandas and scikit-learn.
The usual workflow starts with data cleaning in pandas, further preprocessing using pandas or scikit-learn transformers like StandardScaler, OneHotEncoder etc., then you start working with a machine learning algorithm (scikit-learn).
Now there are some problems with this workflow:
1. The development phase in your workflow is quite complex and requires a lot of code ? 2. It is difficult to reproduce workflow for forecasting in the implementation phase ? 3. Existing solutions to reduce these problems are not good enough (yet) ? Skippa is a package designed to:

  • drastically simplify development
  • ?
    pack all data cleaning and pre-processing along with the algorithm into a single pipeline file
  • ?
    reuse the interface from pandas and scikit-learn, which you are already familiar with”

Skippa helps you easily define data cleaning and pre-processing transformations.
It works roughly as follows:

from skippa import Skippa, columns
from sklearn.linear_model import LogisticRegression

X, y = get_training_data(...)

pipeline = (
    Skippa()
    .impute(columns(dtype_include='object'), strategy='most_frequent')
    .impute(columns(dtype_include='number'), strategy='median')
    .scale(columns(dtype_include='number'), type='standard')
    .onehot(columns(['category1', 'category2']))
    .model(LogisticRegression())
)
pipeline.fit(X, y)

predictions = pipeline.predict_proba(X)

☝️Skippa does not presume to solve all problems, does not cover all the functionality you might ever need, and is not a highly scalable solution, but it should be able to provide a huge simplification for > 80% of regular pandas/sklearn-based machine learning projects.

Links

You can read the rest of the blog here > Introduction Skippa [ENG]

Questions? Please contact us

Blog

This is also interesting

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Four years after brewing the first Dutch data-driven beer, Uiltje Brewery and Data Science Lab are joining forces again. This time with…

A Managed DevOps Pool is the ideal solution for managing the infrastructure you need to run a DevOps pipeline. No hassle with…

DeepSeek has spectacularly positioned itself as a formidable competitor in the world of Large Language Models (LLMs). With stock prices surging and…

Sign up for our newsletter

Do you want to be the first to know about a new blog?