Data Engineering

PySpark basics: “upsetting” data on Databricks

Churn reduceren
Written by
DSL
Published on
June 10, 2025

In this blog, Tim (Data Engineer) explains in an accessible way what the transition from pandas to PySpark looks like for data engineers and data scientists. Although the two libraries have similarities in terms of syntax, there are important conceptual differences. Especially when it comes to processing large amounts of data via distributed computing.


The operation of Apache Spark, the engine behind PySpark, is described and how it uses a driver and executors to process data in parallel. In PySpark, you work with DataFrames divided into partitions, which is essential for scalability. Furthermore, Tim discusses the difference between transformations and actions, the importance of lazy evaluation and why certain operations, such as wide transformations, are much heavier than others. Finally, he gives a practical example of a common task in data pipelines: upsizing data in a source table using PySpark within Databricks.


Key insights:

  • PySpark is the Python interface for Apache Spark.
  • Spark is optimized for big data and uses distributed computing.
  • Lazy evaluation allows Spark to determine the most efficient execution strategy.
  • PySpark is ideal when your data no longer fits on one machine.


👉 Read the full article on Medium: From pandas to PySpark – Tim Winter

Questions? Please contact us

Blog

This is also interesting

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Investing heavily little results Many organizations invest heavily in forecasting models and yet results lag. Inventory does not match demand, decisions are…

Building a good forecast model is one thing. Realizing impact is something else. Everything is correct in your code, the results look…

Generative AI (GenAI) is developing at lightning speed. Not only in text, but especially in images and video, we are seeing huge…

Sign up for our newsletter

Do you want to be the first to know about a new blog?