Using Patterns to Improve Data Engineering

Ashley Day   •   08.09.2019

Throughout my career I’ve had plenty of “Aha!” moments, but perhaps one of the most significant and impactful was the moment I finally understood what a colleague meant when they told me about the importance of “thinking in patterns.” This blog will explore that realization and how thinking in patterns can help to improve all aspects of the development process.

 


Thinking in Patterns

When thinking of patterns within the world of data, it’s natural to think only of those within a data set – but equally important are the patterns that can be identified and leveraged when engineering solutions to get access to that data. Whether building ETL processes or creating the database objects that will hold data, patterns can be used to not only expedite development, but also to reduce the chance for error. A recent project helped me understand firsthand just how valuable pattern recognition can be, and I’ll use aspects of that project to demonstrate that value in this blog. I will provide a few real-world examples from my own work and challenge you to think about how similar solutions might benefit you.

Recognizing Patterns

The most effective way to identify patterns when developing a solution is to ask yourself the question: What process(es) am I repeating? Your answer to this question could vary greatly depending on which phase of development you’re in, but it’s likely you’ll be able to think of at least a few areas where you’re repeating the same steps with only slight variations – those repetitions are patterns. Once you’ve identified those processes, you can use their similarities to speed up development, as is shown in the following examples.

Creating Database Objects

When building out a database or data warehouse, it’s likely you’ll encounter the need to create multiple tables or other database objects that are similar in structure. To meet the needs of a recent project, I was tasked with creating 20+ tables, each representing different file formats with different fields. In addition to the data from the files, each table also had to have metadata fields used for logging and auditing purposes. I was provided with file specifications for each of the file formats that listed the field names and expected data types. I used these specifications, along with the required metadata fields, to generate table Data Definition Language (DDL) scripts in Excel. An example of the process I used is depicted below.

 

patterns

In this example, I had to create two tables – one named Employee, and one named Sales. These tables contained different fields for the data they would store from files, but they each required the same metadata fields and scripting syntax. To generate the DDL, I used formulas within adjacent cells to reference the fields I needed. The fields that were similar to all tables were stored in cell A1 and were referenced in the formulas that contained the “CREATE TABLE” syntax. All body fields contained a formula to combine the field name and data type values followed by a comma, with the last field of the table varying slightly to end with a closing parentheses and semi colon.

If you need only a handful of tables, it might be overkill to use a pattern like this. But when you have to create a large number of data base objects, it can be a great time saver to identify the similarities between them and automate their creation.

ETL Development

One of the most common areas you’ll see patterns in obvious use is within ETL processes. Regardless of the tool being used, any process that is being repeated can and should be put into a format that allows for quick re-use. For example, when creating SSIS packages I often implement Event Handlers to log package executions. These Event Handlers are always configured the same way, so one of the first things I do is create a “Template” package that is comprised of only those Event Handlers and can be copied/pasted any time I need to create an entirely new package. By doing this, I don’t have to repeat the same process of creating them from scratch every time. That example is specific to SSIS, but most ETL tools have similar functionality to allow the same process.

Another great way to take advantage of patterns in your ETL process is if you’re able to work with the language building out that process behind the scenes. Examples of this would be using something like Business Intelligence Mark-Up Language (BIML) in SSIS to build out packages or manipulating the XML used to create Pipelines in Azure Data Factory. While working in the UI is great, having the ability to work with those scripts allows you to really take advantage of patterns and quickly make changes to  your work with a lot less time spent clicking and configuring different tasks.

Building Queries

An additional way to speed the development process is to recognize relational patterns within the data you’re working with. By this, I mean looking at the way the data interacts and identifying similarities where those relationships are repeated. The best example here is to think of hierarchies. You might have two tables with completely different types of  data, but both of those tables have a hierarchical structure. If you have to create a query, view or stored procedure that looks at those tables, you can repeat your logic but substitute in the appropriate table names, as shown in the example below.

 

patterns

In this example, there are two tables with very different types of data. One contains information on record types, and one has data on salespeople. The similarity between these tables is that they both have an ID field to indicate a hierarchy. In the case of Record Type, the highest level of the hierarchy would be a parent record, and in the case of salespeople, it would be the supervisor. If I wanted to create a query to pull the parent record or supervisor into the same line as the child record or salesperson, my query would follow the same pattern but with reference to different field names and tables. Each query would certainly require validation to ensure the results received were as expected, but having established the pattern for one table meant I could develop the query for the other in a fraction of the time.

Recognizing Patterns Elsewhere

The examples I provided show only a few instances where patterns can be utilized to improve the development process, but there are no defined rules for when and how they should be used. The overall goal of this post is to encourage you to start understanding the different ways they can be used, and to start “thinking in patterns” when working with data. They are the foundation to any automation process and understanding how to take advantage of their presence will only help improve development.


Ashley Day

Senior Consultant

Share This Article


At Talavant, our goal is to help you discover, ask and answer the right questions – so you get more value from your data to drive your business forward. You’ll gain a competitive edge with faster, more relevant analytics. It’s all part of our focused, holistic approach that goes beyond tools.

Ready to get started?

Contact Us