Data Engineering

PySpark basics: “upsetting” data on Databricks

Churn reduceren
Written by
DSL
Published on
June 10, 2025

In this blog, Tim (Data Engineer) explains in an accessible way what the transition from pandas to PySpark looks like for data engineers and data scientists. Although the two libraries have similarities in terms of syntax, there are important conceptual differences. Especially when it comes to processing large amounts of data via distributed computing.


The operation of Apache Spark, the engine behind PySpark, is described and how it uses a driver and executors to process data in parallel. In PySpark, you work with DataFrames divided into partitions, which is essential for scalability. Furthermore, Tim discusses the difference between transformations and actions, the importance of lazy evaluation and why certain operations, such as wide transformations, are much heavier than others. Finally, he gives a practical example of a common task in data pipelines: upsizing data in a source table using PySpark within Databricks.


Key insights:

  • PySpark is the Python interface for Apache Spark.
  • Spark is optimized for big data and uses distributed computing.
  • Lazy evaluation allows Spark to determine the most efficient execution strategy.
  • PySpark is ideal when your data no longer fits on one machine.


👉 Read the full article on Medium: From pandas to PySpark – Tim Winter

Questions? Please contact us

Blog

This is also interesting

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Since the launch of tools such as ChatGPT, interest in generative AI (GenAI) has been enormous. The applications seem endless: from content…

What’s new and how do you apply it in practice? Why MLflow? Machine learning projects are growing in complexity. Often multiple data…

Juridische dossiers AI oplossing

AI and data-driven work: these are terms you hear more and more in the legal world. But how far along are law…

Sign up for our newsletter

Do you want to be the first to know about a new blog?