Microsoft has recently incorporated the well-renowned Databricks platform into Azure as a 1st class tool. 1st class being natively supported in other tools such as Azure Data Factory. Azure Databricks is a big step forward in the world of big data and data science. It is a unified Apache Spark platform that allows collaboration between Data Scientist and Data Engineers through notebooks that are integrated directly into the application. It allows you to code in SQL, Scala, R, or Python. In this tutorial, we will learn to configure and run interactive queries on ADLS using Azure Databricks.
Currently, at Talavant, Azure Databricks has played a role as adding Interactive querying capability to Azure Data Lake as well as providing a Data Science platform for companies to get started on their Business Intelligence and Data Science journey.
The above image displays a conceptual architecture for one way of processing and modeling data within Azure Data Lake using U-SQL Scripts. Though U-SQL is an excellent big data batch processing solution, it does not currently have interactive capabilities. Interactive capabilities allow us to do data exploration as well as assist in the development of our batch processing model whichever that may be either Data Vault, Dimensional, or Big Entities.
We will now walk through setting up an Azure Databricks platform to work off of files within Azure Data Lake. We will first spin up the resource within Azure, configure the security principle and app registration and the move on to mounting the Azure Data Lake Store and finally walk through some queries using Scala and SQL.
Configure ADLS Security
Create Azure Data Lake App Registration
- Navigate to Azure Active Directory > App Registrations
- Select “New application registration”
- Name: Give your ADLS App Registration a Name
- Tip* – Use naming conventions when creating objects in Azure. For App Registrations I like: APP-<resourceType>-ResourceName
- Example: APP-ADLS-mydatalakename
- Keep the default Application Type of Web app /API
- Sign-on URL: again, define a naming convention.
- Example: http://talavant.appsregs.APP-ADLS-SeansDataLake.com
- On your App Registration, select Settings>Keys
- Create a Name for your Key, select Duration, and click save. Copy the Key Value that is given, this is the only time you will be able to capture this value. Copy and paste it to a notepad for now.
- Now that you have created your app registration you should see a screen with information about your app registration. Copy and paste the following fields:
- Application ID
- Home Page (this is your URL you created)
Add Azure Data Lake App Registration to your Data Lake Store
- Navigate to your Azure Data Lake Store and then on your blade, click on Access Control (IAM)
- Click +Add
- For Role, select Contributor
- For Select, search for the App Registration you created earlier and select it.
- Hit Save
- Navigate to your Azure Data Lake Store Data Explorer, and the root directory of your conceptual Data Lake (or wherever the App Registration will need access to).
- Click Access
- Select +Add
- Search for your App Registration
- Give it Read, Write, and Execute permissions. Select Add To: “This Folder and all children”. Also, select Add As: “An Access permission entry and a default permission entry”. (this will ensure the access entry is added to any new files that get created in the data lake. Note* if you have a substantial amount of files already in your data lake this could take awhile. This is one small reason why a Big Data Strategy is important.
- From the Main App Registrations page, select Endpoints and Copy and paste the OAUTH 2.0 TOKEN ENDPOINT value.
- We need to grab a few more items from our Azure Portal before we spin up Databricks
- Key Value = Credential
- Token Endpoint = Refresh.url
- CliendID = client.id
For more information: https://hadoop.apache.org/docs/r2.8.0/hadoop-azure-datalake/index.html
Setup Databricks Environment
Now that we have our Data Lake Store security configured, lets setup our Azure Databricks resource. Setting up a Databricks resource is a simple task as within any other resource in Azure and we will not be going through those steps. We will, however, need to configure our clusters which our queries will run on, and create a notebook which is where we will write the queries themselves.
- Below is our Welcome screen when we open Databricks. Click on Clusters on the left panel
- Select +Create Cluser to create a new cluster.
- Give your Cluster a Name
- Choose Scala for
- Select the defaults for Databricks Runtime Version, Python Version, Driver Type, Worker Type, and Min and Max workers.
- Lower the Terminate after 120 minutes if you want to minimize Azure spend. Spinning a cluster back up is fairly quick, usually just a couple of minutes.
Create a Notebook
Now that we have a cluster configured and is running, it’s time to set up a Notebook. A notebook allows us to interact with Databricks, this is where we can write our code and analyze results, both textual and graphical. For a SQL Server professional, this can be considered a similar to a tab in SSMS. In Azure Data Factory, we can also run a Databricks Notebook and pass in parameters.
- Navigate to the Azure Databricks Main Menu and select New Notebook under Common Tasks.
- Give it a name such as NB_DataLake_Interactive and select Scala, Python, R, or SQL. For this tutorial, we will use Scala, take note we can still write SQL if we are in a Scala notebook by writing %sql at the top of each SQL statement.
- Open your notebook and make sure it is attached and running, if it isn’t Start your cluster.
- Connect and Mount Azure Data Lake Store
- At the top of our Notebook, we need to mount our ADLS. If you didn’t already write down the Azure ID’s you will need to go back to the beginning of this blog where it walks you through where to find these ID’s.
- We can hit shift+enter to run a statement. Go ahead and do so, if your values are correct, you will have mounted your Data Lake Store in Databricks.
The example dataset we are working with is the Google GDELT dataset, which can be accessed publicly from its website. https://www.gdeltproject.org/
We will first pull a file from our DataLake STD(Standard) zone.
- We can write SQL by writing %sql at the top of our statement
Create a table from a .csv file:
Run a Select statement:
Display Existing Tables
We can run the below query to verify our table was created, or we can navigate to the storage explorer on our left blade.
There are two types of tables:
- Global – Can be accessed across all clusters. These are stored in the Hive metastore.
- Temp – Can be accessed locally.
Profile the data using Scala:
First, we can pull in the table to a dataframe, this is for tutorial purposes only, but we could pull the data directly into a dataframe from a file if we wanted to. The second image is of using display(<table>.describe()) which gives us some nice common data profiling metrics on our dataset, and it does it fairly quickly.
Visualize your results interactively:
To visualize our data, we can simply select the graph type we want on the second tab at the bottom of our result.
We’ve learned how to run interactive queries on ADLS. This is an excellent solution for developing out batch processes for Azure Data Lake where we might not want to wait for each job to spin up while developing U-SQL scripts. Azure Data Lake can benefit from having both a batch and interactive solution.