This article will provide a brief overview of Spark Streaming, its availability and use in Databricks, and some reasons for consideration.
What is “Spark Streaming”?
Specifically speaking, the technology is named “Spark Structured Streaming.”
According to the overview, Apache sums up it as, “Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.”
Loading processes are typically done in batch, or in other words, many items are loaded or calculated at one time. Usually, batch processes are run on schedules.
Streaming allows for a continuous stream of information to be processed instead.
Spark Structured Streaming makes the transition from batch processing to stream processing easier by providing a way to invoke streams using a lot of the same coding semantics that are used when batch processing.
A continuous stream of information, instead of processing in bulk, provides the following benefits:
- Less computation needed per unit or transaction of data
- Near real-time possibilities as new data is available
Why is there less computation involved? There are the same amount of records being processed after all.
It’s not necessarily that overall less inputs and outputs are being processed, it’s that computation per timeframe changes.
Say we have 100 widgets that need to be fed into a machine that turns them into something else. It’s powerful enough to handle processing 50 widgets at a time but internally has to shuffle the widgets around to provide the correct output.
The streaming model would allow the machine to process one widget at a time instead, continually feeding the output. There is no shuffle internal to the machine needed, therefore less strain on it per widget.
The widgets in this analogy, of course, are data points, and the machine would be an algorithm to process the data written in Spark.
In batch, data is pushed through the algorithm all at once, while streaming will feed it in by atomic units.
Near Real-Time Possibilities
With streaming, the input is being processed as it’s being seen. This is opposed to a batch schedule or trigger, that picks up a chunk of historical data for every new run.
Streaming allows then for the output to be close to the time that the input data was seen. Of course, this depends on the complexity of the code processing the data, and the power of the Spark cluster running the stream. However, it does make near real-time reporting possible.
Streaming in Azure Databricks
Azure Databricks runs on the Spark engine and is a robust tool as an ETL option.
So, how can we utilize Spark Structured Streaming in Azure Databricks? And possibly, more importantly, why utilize it in Azure Databricks?
Streaming in Databricks
The Databricks documentation provides the literature and demos to get started with Spark Streaming within Databricks.
With streaming, there are a few steps in the code to achieve a continual stream of output.
- Read from a large data set, in most cases several input files
- Spark Streaming takes care of staging these files for the stream
- Define a calculation on the data that will be processed as units of the data passed through the stream
- Output a transformed or aggregated set of data, which is updated as it is fed through the calculation
An example, lifted directly from the link above, is we have some files being streamed into Databricks and we want to see some aggregates as the stream is processing.
You can see the initial output:
And the output five seconds later:
Reasons to use in Databricks
- Databricks in Azure allows for connections across the Azure platform. This opens up streaming possibilities using the Spark engine within Azure. It also opens up a different avenue for streaming than Azure Stream Analytics, if you are looking for different solutions within the platform.
- As mentioned before, streaming doesn’t rely on any batch scheduling process. It merely looks for new data as it becomes available. While there is still a refresh rate when visualizing the data, we get the ability to pipe data in near-real-time instead of waiting on historical data.
- Spark can be very powerful if utilized properly. Coupled with the fact that streaming acts on atomic units during computation, this technology can potentially handle very large workloads quickly while not hindering memory usage during computation as much as a batch process might.
Spark Structured Streaming can be a powerful tool and is available within Azure Databricks.
This provides the possibility for almost real-time processing of data through the powerful Spark engine.
It is a consideration when looking at something other than batch processing in your data workflow.
If you have any questions, please feel free to email firstname.lastname@example.org. Thanks for reading!