I recently came across an interesting use case for the Groupby function in DAX, and while doing so, thought it would also make a great example for explaining evaluation context in DAX.
Consider the following table of data showing purchase requirements for two products, from multiple suppliers.
Each product/SKU combination can only be bought in a certain minimum order quantity, so even if you need to purchase only 3 units of product A in an XS stock keeping unit (SKU), you need to buy at least 100 to comply with the rule.
Date | Supplier | Product | SKU | Min unit qty | Qty Required |
31/05/2018 | Supplier2 | A | XXS | 100 | 14 |
31/05/2018 | Supplier2 | A | XXS | 100 | 20 |
31/05/2018 | Supplier1 | A | XS | 100 | 14 |
30/06/2018 | Supplier1 | A | XS | 100 | 3 |
30/06/2018 | Supplier2 | A | S | 100 | 64 |
30/06/2018 | Supplier2 | A | M | 80 | 4 |
30/06/2018 | Supplier1 | B | XXS | 100 | 91 |
30/06/2018 | Supplier1 | B | M | 80 | 9 |
30/06/2018 | Supplier1 | B | L | 80 | 23 |
30/06/2018 | Supplier2 | B | L | 80 | 18 |
30/06/2018 | Supplier1 | B | XL | 60 | 18 |
30/06/2018 | Supplier1 | B | XL | 60 | 18 |
30/06/2018 | Supplier1 | B | XXL | 60 | 7 |
With a bit of manipulation, you could present the data in the following pivot table, using a measure (Base qty) to aggregate the quantity required with a simple sum:
Next, you might want to work out an order plan that complies with your minimum order requirements. In its basic form (I’ll use the DAX Variable syntax), this equation would look something like this, where [Minimum Unit Qty] = MIN(Table1[Min unit qty]) ♦:
Qty in unit packs 1 = VAR frac = DIVIDE ( [Base qty], [Minimum Unit Qty] ) VAR Unitqty = ROUNDUP ( frac, 0 ) RETURN Unitqty * [Minimum Unit Qty]
♦ You could have used the MAX or the SUM function here too.
Adding this measure to the pivot table and removing date information so we just look at the total amount for all time, yields the following result:
If it isn’t immediately obvious, the totals of our new measure don’t add up to what we might have hoped for. This is because our new measure is evaluated in FILTER context.
What does that mean?
While there is no single minimum unit quantity for product A across all SKU’s, (there are multiple), our measure for aggregating the minimum order quantity tells it to look for the MINIMUM value across all Product A’s in the current filter context. In the example above, for Product A, that is 80 units when we consider the total quantity for Product A.
While we know of course that this is not at all relevant to the total quantity across all Product A’s, it does explain the result the formula is giving us. 119 units divided by 80, rounded up to zero and multiplied by 80 gives 160, so that is how the totals are calculated. In the same way, you can confirm how the numbers 240 and even the 360 were arrived at. That is filter context in action.
Coming back to my previous comment where I said you could have also used the MAX function to aggregate the minimum order quantity, the result would just have used the value 100 instead of 80 or 60, in the cases of Product A or B, but your totals would have been calculated in a similar manner.
The SUM function would operate in the same way, except it would use the sum of the minimum order quantities as the denominator and give us a total of 360, which is what we want, but this is purely a coincidence, because the result would have still been calculated using filter context, and if the data were a little different, or even if we just had more rows of data, you would quickly find that the totals don’t “add up”.
Okay, so we know we don’t want filter context when looking at totals, but how can we change FILTER context into ROW context?
The first thing that popped into my head was SUMX, because I know that this function forces filter context into row context, and I often use it to make the totals in pivot tables “add up” to what I want to see, and to what makes sense to me.
Let’s modify our equation to use SUMX then (we’ll ditch the Variable syntax for now because we have to refer to “naked columns” for this SUMX to work):
Qty in unit packs 2 = SUMX ( Table1, ROUNDUP ( Table1[Qty Required] / Table1[Min unit qty], 0 ) * Table1[Min unit qty] )
Adding this measure to our pivot table yields some more confusing results:
The results in the first couple of lines look fine, however the third and fourth lines might give you reason to frown. We only need 17 units (line 3), and the minimum order quantity is 100, yet the formula returns a value of 200? How can that be? All we did was apply a SUMX to the value our measure returns. At least the totals appear to be adding things up as we expected.
The reason for the measure returning 200 in the third line instead of 100 has to do with the fact that the 17 units in our underlying data is made up from two different dates; 14 units in May and 3 units in June (see the very first pivot table). These dates are not shown in our current pivot table, but they still exist in the underlying data source. Our new SUMX formula honours this “hidden” row context, as can be revealed by adding the Date field to the pivot table filtered to show product A and the XS SKU:
The formula therefore does the calculation on both instances, or both ROWS in our underlying data.
What about the fourth line then? The quantity is only valid for one date, and comes from the same supplier too, so why does this result give 200?
To get the answer in THIS case, we have to dig a little deeper. Looking at the source data table, we can see that the 34 units in May are comprised of two different lines, or rows, for 14 and 20 units respectively. Once again, SUMX has performed our calculation on both rows of this data individually before adding the result together, because we have forced context transition to ROW context. The results for Product B can be explained in a similar way, noting that for the SKU = “L”, the product is sourced from two different suppliers, so technically 160 is correct and 80 is wrong! Product B SKU “XL” is made up of two duplicate rows again, like Product A SKU “XXS”.
The GOOD news is that we have confirmed that yes, SUMX changes FILTER context into ROW context, because it now evaluates the formula for each and every row in our source table, and the totals at least “add up” as we would expect. The BAD news is that SUMX is now giving us “wrong” values for some lines.
Okay, you might say, well why don’t you just go back to Power Query or SQL and group the data to get rid of duplicate rows and the date information, and use this as your new source data?
This would work, but we would then lose the ability to apply a date filter to our data after we load it to the data model.
What if we could group the data “on the fly”, doing the grouping only on the selected subset of data as filtered by slicers and other filters?
Enter GROUPBY
The GROUPBY function in DAX can be used to calculate a new table “on the fly”, where we group our underlying data source to only include the columns we specify, while honouring existing external filters.
We can then use SUMX, AVERAGEX, MAXX or any other such iterator to aggregate the numbers in the current group (table) that we are calculating on the fly. To refer to this virtual table, the syntax CURRENTGROUP() is used.
So what does that look like in our example? I want to group this data to exclude date information. We’ll use the DAX Variable syntax again:
Qty in unit packs 3 = VAR Groupedtable = GROUPBY ( Table1, Table1[Supplier], Table1[Product], Table1[SKU], Table1[Min unit qty], "Groupedvalue", SUMX ( CURRENTGROUP (), Table1[Qty Required] ) ) VAR something = SUMX ( Groupedtable, ROUNDUP ( DIVIDE ( [Groupedvalue], Table1[Min unit qty] ), 0 ) * Table1[Min unit qty] ) RETURN something
♣ You don’t have to group data using columns from the same table; you can use columns in lookup tables, but the first table in the GROUPBY function would be the table containing the numeric data you want to group.
While verbose, the result gives us exactly what we need:
Could I have used the SUMMARIZE function?
Absolutely, but this is not a post about SUMMARIZE. We can use a very similar syntax, with the exception of the CURRENTGROUP() reference:
Qty in unit packs 4 = VAR Summarizedtable = SUMMARIZE ( Table1, Table1[Supplier], Table1[Product], Table1[SKU], Table1[Min unit qty], "Summarizedvalue", SUMX ( Table1, Table1[Qty Required] ) ) VAR something = SUMX ( Summarizedtable, ROUNDUP ( DIVIDE ( [Summarizedvalue], Table1[Min unit qty] ), 0 ) * Table1[Min unit qty] ) RETURN something
So how are they different?
Summarize does an implicit CALCULATE to each extension column it adds, whereas GROUPBY does not. GROUPBY is also tipped to be very performant, and personally I found the syntax more palatable than traditional explanations of SUMMARIZE and its variations. But neither of those considerations are important for our current dataset.
The really great thing about using GROUPBY or SUMMARIZE to recalculate the grouped table “on the fly”, is that you can still apply an external filter to the data (such as a date filter, or a transaction ID filter, if we had that detail in our data set) even if it isn’t included in the calculated group, and it will respond appropriately. That is really what enthused me to write this post.
I’m sure there are other implications to consider for choosing between GROUPBY and SUMMARIZE – let me know in the comments – but hopefully you’ve learned something.
Seasonality is an important phenomenon to consider for many businesses, and in the context of this post refers not only to seasons in terms of Winter, Summer, and so on, but will also consider how you can report on business activity in terms of custom-defined seasons.
To state the obvious, an ice cream shop might sell less ice cream during winter than it sells during summer. Clothing, cycling gear and even chocolate are also products that have seasonal elements in their trade. There are countless more examples, but the point is that you can plan your sourcing and/or manufacturing activity in accordance with seasonal demand. Doing so might have some significant financial and logistical implications too – why would you want to have the previous season’s stock, taking up room (and tying up money) in a shop or warehouse where you need space for the current season’s product? Does the product have a short shelf life, or is it sensitive to temperature fluctuations from one season to the next, in which case you really only want to buy enough stock or raw materials for the relevant season.
You may even decide that you want the amount of safety stock you hold to depend on the season, or schedule promotions by season.
Central to your ability to analyse your data on a seasonal basis, is the definition of the seasons you choose to adopt. For instance, do you have only two seasons (busy and not busy), three, four or sixteen seasons? This step is the most important to nut out before continuing.
Let’s use an example where we have four seasons, as defined for Australia in general:
Summer: December to February
Autumn: March to May
Winter: June to August
Spring: September to November
Next is to bring those seasons into your calendar table. Depending on whether you used Get & Transform (Power Query) or Power Pivot to construct your Calendar table, the syntax used for doing this would vary accordingly.
Let’s assume you constructed one using the Calendar construction method described here, starting at 2015.
A Switch(True(),…) calculated column can be used to identify the season of interest:
Season = SWITCH ( TRUE (), 'Calendar'[Month Number] = 12 || 'Calendar'[Month Number] <= 2, "Summer", 'Calendar'[Month Number] <= 5, "Autumn", 'Calendar'[Month Number] <= 8, "Winter", "Spring" )
Next we want to add a counter, or index to the Season, but if we just count the seasons as defined above, we’ll only get a maximum of 4. Perhaps we can combine the Season with the Year value, and therefore get a unique season for each year? Not a bad guess, but there’s a complicating factor here.
The problem with how our seasons are defined, is that the same Summer always spans across two different Year values:
If we simply combined Season with Year to obtain our unique index for each season as follows:
SeasonYear = 'Calendar'[Season] &" "&'Calendar'[Year]
we would get the same index value for Summer 1 (Jan & Feb) and Summer 2 the following December, which falls in the same year. Our index values would effectively mix Summer 1 with Summer 2, Summer 2 with Summer 3, and so on. Not good.
How then, do we get the appropriate index number?
My preferred way is to consider again the definition of the seasons, as well as the starting year of the calendar. The first year in my calendar is 2015, so by varying the number I subtract from the year value depending on the month number, I can ensure that Summer 2 in Dec 2015 yields the value “Summer 2” and Summer 1 in 2015 yields the value “Summer 1” with the following formula:
SeasonNumber = 'Calendar'[Season] & " " & SWITCH ( TRUE (), 'Calendar'[Month Number] >= 12, 'Calendar'[Year] - 2013, 'Calendar'[Year] - 2014 )
So when I consider year 2015, I subtract one less year from 2015 in December, making the index value one larger than that for any other in 2015.
The numeric counter, or season index is then calculated, referring to the SeasonNumber column as the value we wish to count:
SeasonIndex = CALCULATE ( DISTINCTCOUNT ( 'Calendar'[SeasonNumber] ), FILTER ( 'Calendar', [Date] <= EARLIER ( 'Calendar'[Date] ) ) )
The SeasonIndex value provides you with a unique numeric value for each chronological Season in your calendar.
“Why do I need a Season index?”, you may ask. Well, having a numeric index for the season allows us to do some clever stuff in analysing sales (or budgets), such as calculating the average product sales for a particular season in a particular year. Calculating just the average for a particular year, or the average for one of our four seasons (by implication all years), would give us quite a different result.
Your calendar should look something like this now (I deleted day of week detail and hid the columns I don’t want to be available in my Pivot table fields):
Remember to sort the Month column by the Month Number and the FinMonth column by the FinMthNum, otherwise your graphs and pivot tables will sort months alphabetically.
Let’s imagine you have a table of product quantity per month, for a period spanning a few years. The quantity can be a mixture of sales and budget figures, depending on timing:
…
I imported the numbers in the table to Power Pivot using Get&Transform (Power Query), where I converted the Yearmonth values to a date field. This allows me to create a relationship to my Calendar using the Date column. After that I can report on the sales by season, as it is already in my Calendar:
Say for instance I wanted to know what the average sales per season was, superimposed on my existing pivot chart. Hint: Adding a measure that is just the Average( ) of the sales qty to the chart won’t look any different to our existing ProductQty measure using Sum( ) as aggregator, because the evaluation context that the pivot table provides is too granular to see the effect of average aggregation. The desired measure needs to delve a little deeper. Enter AVERAGEX:
SeasonAvgProdQty = AVERAGEX ( FILTER ( ALL ( 'Calendar' ), 'Calendar'[Seasonindex] = MAX ( 'Calendar'[Seasonindex] ) ), [ProductQty] )
Adding this to the pivot chart (with some formatting to highlight the average values and diminish the monthly values) results in the following:
See how useful that Season index has become? The same measure can be added to a pivot table, of course, if you wanted to see the actual values.
So what else could you use the seasonal average value for? Well, how about that safety stock buffer we talked about earlier? You might have a view that you always want to maintain a safety stock buffer of ten percent above the average seasonal budget.
You can probably already guess how this measure will look:
SafetyStock= 1.1* [SeasonAvgProdQty]
Predictably, it will just always be 10% higher than the seasonal average value.
Depending on your data, you might decide that working with four seasons provides too much granularity. In the example above, I would be tempted to combine autumn with winter, and spring with summer, resulting in a Warm and a Cold, or perhaps a Dry and Wet season. This would require a redefinition of the previous seasonal split:
Season = SWITCH ( TRUE (), 'Calendar'[Month Number] >= 9 || 'Calendar'[Month Number] <= 2, "Warm", "Cold" )
We would also need to amend the SeasonNumber definition, as the season change where we span across two distinct Year values now starts in month 9 instead of month 12:
SeasonNumber = 'Calendar'[Season] & " " & SWITCH ( TRUE (), 'Calendar'[Month Number] >= 9, 'Calendar'[Year] - 2013, 'Calendar'[Year] - 2014 )
This results in the following average spread:
Hopefully you’ve learned something, let me know in the comments if you found it useful, and feel free to share with your fellow Power Pivot fans.
Many people who do not regard themselves as Excel gurus find themselves in situations where they would like to be able to represent tabular data somehow to allow them the functionality of filtering, cross-tabulation (pivoting) and hierarchical sorting that pivot tables offer.
The Excel wizard for creating pivot tables could certainly have been more instructive, but that doesn’t mean creating a pivot table has to be hard.
An important concept to understand for creating pivot tables is the difference between data in a pivoted layout versus data in an unpivoted layout. Data in a pivoted layout is what you would commonly encounter in tabulated form in books, web sites, scientific journals, or even the calendar displayed on your computer or smartphone. It is optimised for human consumption and understanding, and a typical feature of it is that one or more categories are represented as rows, while another category (very often related to some reference to time) is represented in the column headings, with the numeric or categorical data presented in the body of the table. The following table is an example of fictitious data in a pivoted layout (not necessarily an actual pivot table):
Without knowing about pivoted vs. unpivoted layouts, people tend to create pivot tables using data in a pivoted layout as source for their pivot tables:
The problem with this approach is that it is actually not possible to swap rows with columns, and if you had hundreds of columns in the source data, it would take you nearly forever to include all the data in the pivot table. Not good. By the way, if the default aggregation that the pivot table performs is Count instead of Sum, it is an indication that you have text or error values in your source data somewhere.
There is some good news though. The same data can be presented in unpivoted form by condensing the table to contain as few columns as is absolutely necessary, while not being concerned with the number of rows. Let’s look at the first table. Unpivoting can be achieved by identifying that the columns all represent one category of data, which in this case is time. If we instead wish to represent that category as one column (MthYear), we can represent the steps conceptually as follows:
Step 1: The columns that represent the category to be transformed into one column (unpivoted) are identified in blue:
Step 2: A new column is created that will contain the old column headings as values:
Step 3: Another new column (Qty) is created that contains all the numeric data of the original table:
Step 4: The old, multiple columns disappear and are replaced by the two new columns, while the row category values are repeated for each new row created, to preserve the context of each numeric value:
All the original information is still there, the difference is just that in the unpivoted state, the table only has three columns, whereas before it had seven. It is usually harder for our brains to conceptualise the numbers than the same data in a pivoted state, but the reason for wanting to do this is because Excel actually prefers the unpivoted version as data source to the pivoted layout. Check out the video below for a demonstration of this:
With this source layout, it is very easy to swap rows with columns, and you only have three fields to include, irrespective of the time horizon your source data considers. Filtering on either rows or columns is also much easier.
There are some great videos online explaining how to unpivot data using Get&Transform, but I’ll add a video here showing how to do it for this particular case. For good measure I’ll also show a few useful data transformations and how to change the pivot table layout itself:
I hope you found this explanation useful. I certainly wish I knew about un-pivoting and how to optimise data sources for pivot tables much earlier than when I did find out, but it’s never too late to learn!