Data Engineering

PySpark basics: “upsetting” data on Databricks

Churn reduceren
Written by
DSL
Published on
June 10, 2025

In this blog, Tim (Data Engineer) explains in an accessible way what the transition from pandas to PySpark looks like for data engineers and data scientists. Although the two libraries have similarities in terms of syntax, there are important conceptual differences. Especially when it comes to processing large amounts of data via distributed computing.


The operation of Apache Spark, the engine behind PySpark, is described and how it uses a driver and executors to process data in parallel. In PySpark, you work with DataFrames divided into partitions, which is essential for scalability. Furthermore, Tim discusses the difference between transformations and actions, the importance of lazy evaluation and why certain operations, such as wide transformations, are much heavier than others. Finally, he gives a practical example of a common task in data pipelines: upsizing data in a source table using PySpark within Databricks.


Key insights:

  • PySpark is the Python interface for Apache Spark.
  • Spark is optimized for big data and uses distributed computing.
  • Lazy evaluation allows Spark to determine the most efficient execution strategy.
  • PySpark is ideal when your data no longer fits on one machine.


👉 Read the full article on Medium: From pandas to PySpark – Tim Winter

Questions? Please contact us

Blog

This is also interesting

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

The health care industry faces major challenges: rising costs, increasing demand for care, and a growing shortage of health care personnel. AI…

Four years after brewing the first Dutch data-driven beer, Uiltje Brewery and Data Science Lab are joining forces again. This time with…

A Managed DevOps Pool is the ideal solution for managing the infrastructure you need to run a DevOps pipeline. No hassle with…

Sign up for our newsletter

Do you want to be the first to know about a new blog?