Regardless of what type of industry you work for there are common components that are essential while designing a data warehouse. The purpose of this blog is to create a list of concepts that are often missed and commonly treated as an afterthought. It is much cheaper to incorporate the answers up front and make the outcome part of your design. This is to be used as a very high level check list to get your wheels spinning; we will delve deeper into each concept in future blogs.
Know Your Sources: Source system analysis is a key component to the first phase of a data warehouse. It is important to understand the source tables and how the data moves within each of the source tables.
- Retention period of the operational system – It is important to know how long the data is available in the operational system. Some tables may only store data for 60 days; while others may store data forever. This will become important for initial loads, incremental processing, deletes, etc. Remember all sources should not be created equal.
- How long can users update records? It is important to know how far back the end user can make physical changes to records; this will help determine how to process the data and will help with future support of the data warehouse.
- Are deletes allowed? It is important to know if the operational system allows the user to delete records and what type of deletes (physical or soft)? Is there referential integrity set on the operational tables so you don’t have orphan records? (For example, there is a header and detail account tables, the header account record is physically removed from the system, but the detail records for these accounts still exist resulting in orphan records.)
- Understand the physical table structures: Primary keys, foreign keys, data types, nullability, etc.
- Understand the data movement: What values are allowed in a field? Is there a series of events the field will obtain? (For example, many times a status field will have phases: Open, in-process, complete. Does the business want to capture each phase of the events or just when record is completed records?)
- Understand the lookup tables that can be used from operational system to help build the dimension tables; Always keep an enterprise vision in mind.
Know your Targets: Don’t just save the targets for the end. As you start to understand the business needs and source systems you can start to visualize your targets forming. It is import to understand how the business wants to see the data displayed, how much data to retain, when you’re pulling from multiple sources to populate one table making sure the data types accommodate accordingly.
- Type of facts to build: transactional, summary, snapshots, etc.
- Daily processing of fact table: incremental, truncate/load, how many days to reprocess, how will you handle deletes. Etc.
- Initial load of the facts: Will I be using the same source tables as the daily increments? How far do we want to go back in history? Do I want to have separate programs to load history vs daily increments?
- Type of dimension table you need to build: This will most likely be questions you will want to ask the business and the processing will be different based on the type of dimension you will be * Daily processing of dim tables: incremental, handle deletes, inferred members, default records, initial load, business rules and enterprise dimensions.
Know Your Batch Processing: How the data warehouse will come together and work as a whole with all the different needs from each of the sources and business partners.
- Retention period in the EDW – how long do you want to keep your data for? Maybe you want the warehouse to be 36 months and the mart to be 24 months? Do we need to build a purge process?
- What is the best time to process the data?
- What is the service level agreement (SLA) of the data warehouse? Is there someone I should alert if there are batch issues? Is there someone to alert if there are data issues?
- Setting up data alerts to insure the data integrity stays intact.
I can’t stress enough that this list only contains nuggets that I have found throughout the years that are commonly missed. It is not all-inclusive, but should help set a solid data warehouse foundation.