U-SQL: Dynamic File Sets – Outputters

04.04.2018

Dynamically split data into multiple file sets using U-SQL’s new functionality of FileSet Outputters.

U-SQL(Universal SQL) allows you to work across any data type, it is a hybrid of C# and T-SQL, and allows for scalability in performance. U-SQL is part of the Azure Data Lake Analytics service.

Before dynamic Fileset Outputters were released you still had the capability of dynamic Fileset Inputters. This allowed you to work across multiple source files through the use of Virtual Columns, but you could not use this behavior to dynamically create new files. Your options before would be to create multiple Output statements or create a Powershell script that iterates over a U-SQL Stored Procedure. Neither of which were effective long-term solutions.

Now Microsoft recently introduced the capability of Dynamic Outputters. This allows you to Output files based on columns from the dataset. This can be extremely beneficial when we need to part out a large file after ingestion or conform it to our data lake standard.

*Notice – This feature is still in preview and requires the feature to be enabled.
Do so with this code: @@FeaturePreviews = “DataPartitionedOutput:on”;

 

 

Original Functionality Example:

Problem: We need to dynamically combine and process multiple files to create a single dataset.
Solution: U-SQL Dynamic Input Filesets

 

 

 

  • SOURCE: Multiple files; one for each month
  • VARIABLES: We declare two variables @IN and @OUT to hold our input and output file paths. You can virtualize a column by using  {<nameOfVirtualColumn>}. We then pull this virtual column into our EXTRACT statement and we can now associate the combined data with it’s originating file set.
  • EXTRACT Statement: This is where the schema-on-read occurs. We define our schema using C# data types and we can define our column names to whatever we would like. The other part to this is the USING construct and Extractors, we need to define the source file format. We can either create a custom extractor or use one of the native options(ie. text, csv, tsv).
  • QUERY Statement: Once our dataset is defined, we then can begin the transformation phase. You can query off of your Extract statement and even augment it against U-SQL tables.
  • OUTPUT Statement: Our end result, we need to tell U-SQL where to land our file set and how(format). The result of this script would combine our source files separated by month in a file grouped by year. U-SQL works best on fewer larger files vs. numerous smaller files.

 

 

 

 

 

New Functionality Example:

Problem: We need to dynamically create multiple output files from 1 or more input files.
Solution: U-SQL Dynamic Output Filesets

 

 

 

  • SOURCE: One large file spanning an entire year’s worth of data
  • VARIABLES: The difference here is that we are also virtualizing our file path in the @OUT variable using our Virtual Date Column and parsing out the year and month to be used in our output file set. We see how to do this in the next steps
  • EXTRACT Statement: We bring in the fileName column in from our @IN variable. In this case that would be “File_2018”.
  • QUERY Statement: Querying off of our Extracted dataset, we want to create separate files based off the CreateDate column. We do so with: DateTime.Parse(CreateDate) AS Date.  Back in our @OUT variable we then can use {Date:yyyy} to get the year and {Date:MM} to get the month from our newly parsed date field. With this new functionality, U-SQL will then use this dynamic output path to determine where to land a new file set based on the virtual columns
  • OUTPUT Statement: Nothing changes here, this is the statement doing the actual output of files.

 

 

 

 

 

U-SQL in Azure Data Lake Analytics is a powerful tool for working and analyzing file sets at massive scale. It is a true compliment to the Enterprise Data Warehouse and strategic asset for modern data initiatives.

 

 

 

 


At Talavant, our goal is to help you discover, ask and answer the right questions – so you get more value from your data to drive your business forward. You’ll gain a competitive edge with faster, more relevant analytics. It’s all part of our focused, holistic approach that goes beyond tools.

Ready to get started?

Contact Us