Spark Streaming in Databricks – Why do it?

Turner Kunkel   •   07.25.2019

Overview

This article will provide a brief overview of Spark Streaming, its availability and use in Databricks, and some reasons for consideration.

 

What is “Spark Streaming”?

Specifically speaking, the technology is named “Spark Structured Streaming.”

According to the overview, Apache sums up it as, “Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.”

Loading processes are typically done in batch, or in other words, many items are loaded or calculated at one time.  Usually, batch processes are run on schedules.

Streaming allows for a continuous stream of information to be processed instead.

Spark Structured Streaming makes the transition from batch processing to stream processing easier by providing a way to invoke streams using a lot of the same coding semantics that are used when batch processing.

A continuous stream of information, instead of processing in bulk, provides the following benefits:

  • Less computation needed per unit or transaction of data
  • Near real-time possibilities as new data is available

Less Computation

Why is there less computation involved?  There are the same amount of records being processed after all.

It’s not necessarily that overall less inputs and outputs are being processed, it’s that computation per timeframe changes.

Say we have 100 widgets that need to be fed into a machine that turns them into something else.  It’s powerful enough to handle processing 50 widgets at a time but internally has to shuffle the widgets around to provide the correct output.

The streaming model would allow the machine to process one widget at a time instead, continually feeding the output.  There is no shuffle internal to the machine needed, therefore less strain on it per widget.

The widgets in this analogy, of course, are data points, and the machine would be an algorithm to process the data written in Spark.

In batch, data is pushed through the algorithm all at once, while streaming will feed it in by atomic units.

spark streaming

spark streaming

Near Real-Time Possibilities

With streaming, the input is being processed as it’s being seen.  This is opposed to a batch schedule or trigger, that picks up a chunk of historical data for every new run.

Streaming allows then for the output to be close to the time that the input data was seen.  Of course, this depends on the complexity of the code processing the data, and the power of the Spark cluster running the stream.  However, it does make near real-time reporting possible.

 

Streaming in Azure Databricks

Azure Databricks runs on the Spark engine and is a robust tool as an ETL option.

So, how can we utilize Spark Structured Streaming in Azure Databricks?  And possibly, more importantly, why utilize it in Azure Databricks?

Streaming in Databricks

The Databricks documentation provides the literature and demos to get started with Spark Streaming within Databricks.

With streaming, there are a few steps in the code to achieve a continual stream of output.

  • Read from a large data set, in most cases several input files
  • Spark Streaming takes care of staging these files for the stream
  • Define a calculation on the data that will be processed as units of the data passed through the stream
  • Output a transformed or aggregated set of data, which is updated as it is fed through the calculation

An example, lifted directly from the link above, is we have some files being streamed into Databricks and we want to see some aggregates as the stream is processing.

You can see the initial output:  spark streaming

And the output five seconds later: spark streaming

Reasons to use in Databricks

  • Databricks in Azure allows for connections across the Azure platform.  This opens up streaming possibilities using the Spark engine within Azure.  It also opens up a different avenue for streaming than Azure Stream Analytics, if you are looking for different solutions within the platform.
  • As mentioned before, streaming doesn’t rely on any batch scheduling process.  It merely looks for new data as it becomes available.  While there is still a refresh rate when visualizing the data, we get the ability to pipe data in near-real-time instead of waiting on historical data.
  • Spark can be very powerful if utilized properly.  Coupled with the fact that streaming acts on atomic units during computation, this technology can potentially handle very large workloads quickly while not hindering memory usage during computation as much as a batch process might.

 

Review

Spark Structured Streaming can be a powerful tool and is available within Azure Databricks.

This provides the possibility for almost real-time processing of data through the powerful Spark engine.

It is a consideration when looking at something other than batch processing in your data workflow.

 

If you have any questions, please feel free to email turner.kunkel@talavant.com.  Thanks for reading!


Turner Kunkel

Senior Consultant

Share This Article


At Talavant, our goal is to help you discover, ask and answer the right questions – so you get more value from your data to drive your business forward. You’ll gain a competitive edge with faster, more relevant analytics. It’s all part of our focused, holistic approach that goes beyond tools.

Ready to get started?

Contact Us