Scale ETL Ops with Azure Databricks

Turner Kunkel   •   03.04.2019

Introduction

This article will give an overview of using Azure Databricks to scale out ETL operations, connecting to Azure’s Data Factory services as well as writing data to database systems.

It will review the capabilities of Databricks which allows it to be such a scalable transformation tool in the data ingestion and analytics landscape.

For a review of Databricks in general, you can visit their site, and for an overview of the resource in Azure, you can visit this site.

(Please also see a previous article on using Azure Data Lake Storage and Databricks.)

The Problem

As organizations grow, so does their data. Inevitably, extract-transform-load operations become more necessary and therefore can be cumbersome if not scaled properly with the correct tools.

Azure Data Factory using Azure Databricks is one possible solution to this expanding landscape.

 

Why Azure Databricks?

Databricks’ integration within Azure allows for cloud resource utilization and connection, all within the Azure platform.

What This All Means

Databricks provides a scalable layer in Azure ETL architecture with flexible language support for developers and analytics.

It provides the opportunity to move data between multiple technologies within the architecture using one tool.

 

Scale With Azure Data Factory

Azure Data Factory, as documented in the linked article above, includes an activity that will run a Databricks notebook.

The notebook can accept parameters from Data Factory and can be placed in a For-Each activity, which will run up to 50 simultaneous tasks.

What does this mean?

Databricks can be utilized to scale up Data Factory operations for dynamic task completion, using whatever code necessary within the Databricks notebook.

These tasks will be completed more quickly running in parrallel.

 

Spark Platform

Apache Spark is the computing engine that Databricks runs on.  In the background, notebooks compute against a Spark cluster.  The clusters are scalable based on the amount of computing power needed.

Spark is very useful for distributed computing and handling of big data processes.

Combining this availability with Azure Data Factory provides the ability to run parallel big data throughputs and operations in Azure; this is accomplished without the same infrastructure needed when traditionally attempting to achieve the same data ingestion goals.

Data Storage

Files can be stored in the Databricks File System for access during notebook operations.

Data can be stored natively within a Databricks cluster using Hive Metastore.  The native storage is useful for passing information between notebooks and cells if necessary.

 

Multi-Language Support

Developers and data scientists grow accustomed to using specific languages for computing, data mining, and analytics.  Databricks has flexibility here by interpreting Scala, Python, R, and Spark SQL in its notebooks.

Each language has its own benefits, performances, library interactions, and functionalities.

Source and Destination Support

Each language has support for connections to different source and destinations for data movement.  This gives Databricks openness for many connections within a notebook.

Databricks as an Azure service, then, is even more extensible as it connects to many Azure data points using these languages.

 

Putting It All Together

The combination of Azure Data Factory and Azure Databricks’ extensibility, language support, scalability, and parameterization gives way to a powerful set of tools to perform data movement and ETL on a massive scale.

A recommendation would be to utilize Databricks in a data transformation capacity within an ETL platform because of these capabilities.

Data Factory Data Flow

Azure Data Factory’s future Data Flow capability is in fact built on Databricks.  Data flows are typically used to orchestrate transformation rules in an ETL pipeline.

Currently, Data Flows are in limited preview.  This direction from Microsoft to use Databricks as their underlying technology shows that usage of Databricks is a robust approach to take.

 

Business Use Case Examples

  • Big data ingestion and output
  • Complex data transformations
  • Cloud ETL provisioning
  • Parallel script executions
  • ETL operation consolidation

Real World Example

Let’s presume your company has several terabytes of data in different formats that require ingestion and transformation that are to end up on several data platforms.

Azure Data Factory and Databricks can be leveraged to:

  • Construct a framework to ingest multiple data sources through Data Factory pipelines
  • Transform data sets and send them to appropriate destinations using Databricks notebooks running simultaneously
  • Read metadata to instruct the framework and Databricks notebooks on what commands and pathways to follow

Once constructed, maintenance is reduced to metadata upkeep and operations monitoring.  Luckily, Azure provides active monitoring and analytics on its resources.

Summary

Azure Databricks is an excellent tool to consider when implementing a data ingestion and transformation strategy on a large scale.

Combined with Azure Data Factory and a solid workflow architecture, it can be a solution for a burgeoning data movement need in the industry.

Notes

As with any tool or approach, this is not magic or a “silver bullet.”  Work is still required to understand and master the technologies.  Security is also tantamount when implementing such tools, based on company requirements.

Just as well, these technologies may not be the proper answer to a company’s question of how best to move and transform data.

However, there is a point where this solution provides substantial benefits and return on investment.  The decision to make the switch comes from foresight in knowing this is available and determining at which point it can be useful.


Turner Kunkel

Senior Consultant

Share This Article