Making Excel Reproducible

Reproducibility

Sometimes, you just can't get the answers you need from your BI system. Maybe not all the data is available yet; maybe your self-service tool doesn’t offer the flexibility required to answer a specific question without exporting the data; or maybe nobody's trained you to use your BI system. For lots of analysts, Excel is the tool of choice for working with data since you can use pretty much any data you want.

Excel is great: thanks to its grid-system, it's powerful and extremely flexible. However, it's easy to fall into a trap of depending on one person's manual Excel process to produce mission-critical metrics on a recurring basis. You've probably seen it before: every month, one person extracts data from multiple systems and combines them using Excel formulas in order to generate a metric or dataset. The logic embedded in the workbook is a labyrinth, with calculated columns referencing other calculated columns that use nested formulas.

The problem with results generated from manual processes is that they're not very reproducible – or weakly reproducible, at best. When I say that a result is reproducible, I'm borrowing from the scientific principle of reproducibility, which states that an independent researcher must be able to duplicate your experiment and arrive at a similar result.

In this context, I define the reproducibility of a product (a report, metric, analysis, dataset, etc.) as the ability to reliably generate the same product independently – that is, without requiring the original creator to babysit.

The key word is "reliably"; even if you document every step in your process, each click and keystroke is a potential point of failure, whether it's your coworker following your instructions or yourself, a month from now. Maybe they fat-finger the keyboard; maybe they type an HLOOKUP instead of a VLOOKUP; maybe they simply miss a step. When humans do things, they make mistakes.

Manual work is often unavoidable when working on a new data product, but you should take steps to support automation of your process because 1) you may be asked to produce the same product with updated data, and 2) human error means that someone else (or even you) could follow your instructions and arrive at different results.

Applying the Principle to Excel

Ideally, you implement a BI solution to address these scenarios: it updates regularly, provides an audit trail, and performs the same calculations the same way, every time. But as both BI professionals and non-IT know, BI projects can take time – and while the BI team is working on that, someone needs answers before lunch. Sometimes Excel is just the best (or only) solution.

If you need to manipulate data in Excel, here are some tips for future-proofing your process:

1. Avoid cell references.

I'm talking about formulas like this: =IF(OR(ISERROR(C2), ISERROR(D2)), "N/A", C2 + D2). Ironically, this is what many people like most about Excel: its ability to reference things that you can see directly on the screen. However, this can lead to a chain of references that resemble your tax return (add line 4, 5, and 6 and enter the total on line 7; then subtract line 7 from line 11 and enter that number on line 13). The problem with this chain of references is that if any link in the chain is broken, the whole thing breaks.

Plus, cell coordinates have a nasty habit of changing invisibly; just take a look at the Google results for "excel avoid cell references changing". Microsoft has built in some intelligence so that your formulas don't get messed up, so using cell references isn't necessarily a recipe for disaster; however, I find that it's much cleaner (and typically easier to interpret) if you format ranges as tables, and reference columns by their header name.

2. Get your data and perform basic manipulations using Power Query.

Power Query lets you import data directly from a variety of sources (flat file, XML, SQL, Excel, and more), as opposed to copying and pasting one Excel sheet into another. Once you import your data, you can perform many of the same operations that you would in regular Excel. These include:

  • Adding calculated/derived columns (e.g. extracting LastName from EmployeeName)
  • Deleting columns (e.g. removing an old column that you don't need, or a duplicate column)
  • Renaming columns (e.g. changing obscure Oracle-style column names into human-friendly names)
  • Filtering down to rows that meet a certain condition (e.g. excluding rows with nulls)

Of course, you can do all of these things in vanilla Excel. The real benefit to Power Query is that it's self-documenting, and can repeat your steps with an updated data set. As you perform each operation, Power Query will add it to a running list of applied steps. Once you load the prepared dataset into your workbook, you can always refresh your dataset. Provided your connection is still valid when you refresh, Power Query will go back to the source, apply the same steps you took, and return the updated dataset.

3. Keep complex calculations and logic in Power Pivot

Excel 2013 and 2016 come with Power Pivot, which lets you employ DAX to perform OLAP-like calculations in a single line. The library of time intelligence functions is especially rich; since there's a natural date hierarchy (year/quarter/month), there are pre-built functions for many kinds of date logic. You can also create hierarchies for any other tables you import, which lets you handle rollups smoothly. Plus, PivotTables created from an Excel data model allow you to perform distinct counts without the old-school “count the first row in which this value appears” method.

In addition to letting you use DAX to create your own calculated fields and KPIs, enabling the Power Pivot add-in lets you build a data model behind your workbook. In this data model, you can define relationships between all your source datasets – in other words, you can model your joins. When you do this, any PivotTables you create using your data model will invoke those joins as needed.

Having your data model live in a separate space from your main Excel workbook prevents you from polluting your data (e.g. by mixing your PivotTables with your source tables), and baking in your logic into the data model will save you the effort of performing tedious, manual tasks such as writing multiple VLOOKUPs to combine data from different tables; cutting and pasting rows or columns; or repeatedly adding columns for calculated fields. And if you discover an error in your results, it'll be easier to make fixes and updates – just make changes in the data model and refresh everything in the front-end workbook.

4. Try Power BI

This will mostly take away your ability to modify individual cells, which is a good thing if you want to avoid creating a messy, manual process. You can still load Excel files or flat files, as well as manually enter values into a separate table in your Power BI data model, but you won't be able to edit individual values in tables that you import.

Tying your own hands may seem counter-intuitive, but it's really about making sure you don't get in your own way. It's like placing your alarm clock on the other side of the room when you need to get up early, or putting the cookie jar on a high shelf when you're on a diet.

In addition, Power BI brings more to the table. It includes Power Query and the same data modeling tools as Power Pivot, but also comes with a suite of interactive visualizations that you can use to create dashboards. Furthermore, it's clear that Power BI is the future for Microsoft. With monthly updates, they're constantly improving the product.

Summary

When there's a gap in what your BI system can do for you, you'll probably turn to Excel. You might think it's all you need because it's so flexible – but this flexibility also leads to the proliferation of more and more manual work, which becomes a potential point of failure. You shouldn’t stop using Excel, but when you need to use it, try using the powerful add-ins that Microsoft has developed:

  • Power Query, for basic enhancement and cleansing of incoming datasets
  • Power Pivot, for complex calculations and the ability to use DAX
  • Power BI (if you don’t absolutely need the Excel grid), for both of the above, plus improved visualizations

Using these features will save you headaches down the road.