Introduced in 2017, Azure Databricks provides a powerful platform for operating big data pipelines. If you’ve invested in a data lake built on Azure Data Lake Store (ADLS) Gen1 or Gen2, Azure Databricks is the solution of choice for processing that data. It combines the power of Apache Spark with a friendly polyglot notebook interface, Git integration, and managed clusters. In addition, it natively supports Azure AD and storage.
Unlike some other Azure offerings such as Logic Apps, Azure Databricks is code-based. This reflects Spark’s primary user base: data engineers and data scientists. Being code-based enables incredible flexibility; however, it also comes with the risk of generating tech debt, especially as the platform continues to add features.
Are you implementing Azure Databricks on top of Azure Data Lake Store? Here are some best practices collected from our experience.
Parameterize notebooks as much as possible
Parameters are crucial for maximizing abstraction and encapsulation. They enable you to write less code to do more. Databricks notebooks can take in parameters using widgets. These parameters allow you to drive a data pipeline using common patterns. In addition, you can pass in parameter values from Azure Data Factory.
Some examples of use cases include:
- Creating a single set of notebooks for all facts in a dimensional model, passing in each fact’s grain and measures as parameters
- Using a single set of notebooks to load development, test, and production environments
- Switching between incremental and full extracts for large data sources
Use Azure Key Vault-backed secrets for credentials
When using Azure Databricks, you’ll need to authenticate to external data sources. When working in a private notebook, you might do this by hard-coding a service principal’s credentials into a cell in the notebook. In production, however, you’ll want to authenticate securely, without exposing credential information.
Databricks offers secrets, which allow you to manage sensitive/secure information. Additionally, Azure Databricks allows you to use secrets stored in an Azure Key Vault. This means that if you’re already using an Azure Key Vault to store sensitive information, you can use that same resource for credentials that Azure Databricks needs in order to access external data sources.
For more information, see https://docs.azuredatabricks.net/user-guide/secrets/index.html.
Take advantage of file partitioning
Partitioning is a key strategy for storing big data. A common method is partition files by date. If your more extensive data sets are partitioned appropriately, take advantage of this fact to optimize your processing.
For example, you can keep a file or table that tracks watermarks indicating the most recent date extracted from the data lake. Using this watermark, you can eliminate unnecessary reads, which reduces your costs and decreases your pipeline’s run time.
Default to direct access for Azure storage
The two main ways to access files and folders in ADLS are direct access and mounting. Mounting adds a pointer to the specified directory in DBFS.
To limit unnecessary exposure of files and folders in the data lake, use direct access as the default mode of access from Azure Databricks.
Only create a mount point if:
- You want all users in the workspace to have access to the mounted container or directory (this is an unavoidable consequence of mounting)
- You need to call filesystem methods on the directory that are not exposed through
Some examples of use cases for mounting a data lake directory include:
- Making it convenient to access a shared log directory with many subdirectories by mounting it at the beginning of an ETL pipeline
- Generating non-partitioned files from Azure Databricks and writing to disk (e.g. creating a single JSON document)
Use the shared folder in your workspace
When a user creates a new notebook, it’s placed in their personal folder by default. Not everyone is aware that a “Shared” folder exists as well. Use this folder for:
- Code that will operate in production, such as notebooks that are part of your ETL pipeline
- Collaborative notebooks
Deliberately moving a notebook into the Shared folder makes it clear that the notebook is not simply a personal experiment.
Databricks is a powerful platform that adds collaborative workspaces and managed compute resources to Apache Spark. Azure Databricks adds integration with Azure security and storage on top of that. However, power and flexibility come with a certain degree of risk – especially the risk of tech debt.
In this post, I described five best practices when using Azure Databricks with Azure Data Lake Store as a data source:
- Parameterize notebooks as much as possible
- Use Azure Key Vault-backed secrets for credentials
- Take advantage of file partitioning
- Default to direct access for Azure storage
- Use the shared folder in your workspace
These are recommendations that can help ensure that your Azure Databricks project translates well to a production solution. If you’d like help implementing a significant data pipeline with Azure Databricks, please reach out: firstname.lastname@example.org