Data Engineering

PySpark basics: “upsetting” data on Databricks

Churn reduceren
Written by
DSL
Published on
June 10, 2025

In this blog, Tim (Data Engineer) explains in an accessible way what the transition from pandas to PySpark looks like for data engineers and data scientists. Although the two libraries have similarities in terms of syntax, there are important conceptual differences. Especially when it comes to processing large amounts of data via distributed computing.


The operation of Apache Spark, the engine behind PySpark, is described and how it uses a driver and executors to process data in parallel. In PySpark, you work with DataFrames divided into partitions, which is essential for scalability. Furthermore, Tim discusses the difference between transformations and actions, the importance of lazy evaluation and why certain operations, such as wide transformations, are much heavier than others. Finally, he gives a practical example of a common task in data pipelines: upsizing data in a source table using PySpark within Databricks.


Key insights:

  • PySpark is the Python interface for Apache Spark.
  • Spark is optimized for big data and uses distributed computing.
  • Lazy evaluation allows Spark to determine the most efficient execution strategy.
  • PySpark is ideal when your data no longer fits on one machine.


👉 Read the full article on Medium: From pandas to PySpark – Tim Winter

Questions? Please contact us

Blog

This is also interesting

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Juridische dossiers AI oplossing

AI and data-driven work: these are terms you hear more and more in the legal world. But how far along are law…

Generative AI

Applications by domain, technique and complexity level, including practical examples Generative AI (GenAI) is no longer future music; it is a game-changer…

The health care industry faces major challenges: rising costs, increasing demand for care, and a growing shortage of health care personnel. AI…

Sign up for our newsletter

Do you want to be the first to know about a new blog?