This article will give an overview of using Azure Databricks to scale out ETL operations, connecting to Azure’s Data Factory services as well as writing data to database systems.
It will review the capabilities of Databricks which allows it to be such a scalable transformation tool in the data ingestion and analytics landscape.
(Please also see a previous article on using Azure Data Lake Storage and Databricks.)
As organizations grow, so does their data. Inevitably, extract-transform-load operations become more necessary and therefore can be cumbersome if not scaled properly with the correct tools.
Azure Data Factory using Azure Databricks is one possible solution to this expanding landscape.
Why Azure Databricks?
Databricks’ integration within Azure allows for cloud resource utilization and connection, all within the Azure platform.
- Azure Data Factory activity
- Apache Spark clusters
- Language support for Scala, Spark SQL, Python, and R
- Connections to Azure Blob Storage, Azure Data Lake, Azure SQL Server, Azure SQL Data Warehouse, Snowflake, and others
What This All Means
Databricks provides a scalable layer in Azure ETL architecture with flexible language support for developers and analytics.
It provides the opportunity to move data between multiple technologies within the architecture using one tool.
Scale With Azure Data Factory
What does this mean?
Databricks can be utilized to scale up Data Factory operations for dynamic task completion, using whatever code necessary within the Databricks notebook.
These tasks will be completed more quickly running in parrallel.
Apache Spark is the computing engine that Databricks runs on. In the background, notebooks compute against a Spark cluster. The clusters are scalable based on the amount of computing power needed.
Spark is very useful for distributed computing and handling of big data processes.
Combining this availability with Azure Data Factory provides the ability to run parallel big data throughputs and operations in Azure; this is accomplished without the same infrastructure needed when traditionally attempting to achieve the same data ingestion goals.
Files can be stored in the Databricks File System for access during notebook operations.
Data can be stored natively within a Databricks cluster using Hive Metastore. The native storage is useful for passing information between notebooks and cells if necessary.
Developers and data scientists grow accustomed to using specific languages for computing, data mining, and analytics. Databricks has flexibility here by interpreting Scala, Python, R, and Spark SQL in its notebooks.
Each language has its own benefits, performances, library interactions, and functionalities.
Source and Destination Support
Each language has support for connections to different source and destinations for data movement. This gives Databricks openness for many connections within a notebook.
Databricks as an Azure service, then, is even more extensible as it connects to many Azure data points using these languages.
Putting It All Together
The combination of Azure Data Factory and Azure Databricks’ extensibility, language support, scalability, and parameterization gives way to a powerful set of tools to perform data movement and ETL on a massive scale.
A recommendation would be to utilize Databricks in a data transformation capacity within an ETL platform because of these capabilities.
Data Factory Data Flow
Azure Data Factory’s future Data Flow capability is in fact built on Databricks. Data flows are typically used to orchestrate transformation rules in an ETL pipeline.
Currently, Data Flows are in limited preview. This direction from Microsoft to use Databricks as their underlying technology shows that usage of Databricks is a robust approach to take.
Business Use Case Examples
- Big data ingestion and output
- Complex data transformations
- Cloud ETL provisioning
- Parallel script executions
- ETL operation consolidation
Real World Example
Let’s presume your company has several terabytes of data in different formats that require ingestion and transformation that are to end up on several data platforms.
Azure Data Factory and Databricks can be leveraged to:
- Construct a framework to ingest multiple data sources through Data Factory pipelines
- Transform data sets and send them to appropriate destinations using Databricks notebooks running simultaneously
- Read metadata to instruct the framework and Databricks notebooks on what commands and pathways to follow
Azure Databricks is an excellent tool to consider when implementing a data ingestion and transformation strategy on a large scale.
Combined with Azure Data Factory and a solid workflow architecture, it can be a solution for a burgeoning data movement need in the industry.
As with any tool or approach, this is not magic or a “silver bullet.” Work is still required to understand and master the technologies. Security is also tantamount when implementing such tools, based on company requirements.
Just as well, these technologies may not be the proper answer to a company’s question of how best to move and transform data.
However, there is a point where this solution provides substantial benefits and return on investment. The decision to make the switch comes from foresight in knowing this is available and determining at which point it can be useful.