Run production Spark jobs and train ML models on Databricks Job Clusters.

Source: by Myriam Jessier on Unsplash

In this post I would like to describe my experience executing production Spark and ML jobs on Databricks job clusters.
So far, I’m a big fan of Databricks solutions since I find them so much better than the alternatives that I’ve used , and no — I’m not a Databricks employee.

In my daily job, I develop an automated AI-based predictive analytics platform that simplifies and speeds the process of building and deploying predictive models.
I’ve spent my last two years building data and ML pipelines that includes data cleaning , structuring, feature engineering, training, evaluation, predictions and monitoring jobs.

Let’s begin with the most important point — using caching feature in Spark is super important . If we don’t use caching in the right places (or maybe don’t use it at all) we can cause severe performance issues.

Sometimes, even if we really understand a concept, it’s challenging to be aware of it with every line of code we write. This is especially true when we are talking about code that is executed in lazy evaluation mode.

It’s vital for us to “develop” the instinct of caching in the right places. …

Nofar Mishraki

ML Engineer at Pecan | Ex — Unit 8200 | M.S.c Computer Science

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store