In this post I would like to describe my experience executing production Spark and ML jobs on Databricks job clusters.
So far, I’m a big fan of Databricks solutions since I find them so much better than the alternatives that I’ve used , and no — I’m not a Databricks employee.
In my daily job, I develop an automated AI-based predictive analytics platform that simplifies and speeds the process of building and deploying predictive models.
I’ve spent my last two years building data and ML pipelines that includes data cleaning , structuring, feature engineering, training, evaluation, predictions and monitoring jobs.
Let’s begin with the most important point — using caching feature in Spark is super important . If we don’t use caching in the right places (or maybe don’t use it at all) we can cause severe performance issues.
Sometimes, even if we really understand a concept, it’s challenging to be aware of it with every line of code we write. This is especially true when we are talking about code that is executed in lazy evaluation mode.
It’s vital for us to “develop” the instinct of caching in the right places. …
ML Engineer at Pecan | Ex — Unit 8200 | M.S.c Computer Science