When AI meets BI: How artificial intelligence may change Power BI development.

AI is driving change – this blog looks at how AI generated Vega/Vega-lite may impact Power BI development.

Introducing Vega/Vega-lite

Vega and Vega-lite (a simplified version of Vega) are visualization grammars – JSON schemas for specifying how data should be presented visually. Vega/Vega-lite allows almost any visualisation to be specified. Power BI currently supports a large and growing number of visualizations – but is it really necessary to have some many? Is it possible to have one visuallization that can be customised to do whatever is needed?

Continue reading

The DAX Summary Pattern

One of the principles of Python is “There should be one, and preferably only one, obvious way to do it”. DAX isn’t like this. DAX has many functions that do similar things.  For example, joining two strings with a space between them can be done in three different ways:

  1. "String1" & " " & "String2"   
  2.  COMBINEVALUES ( " ", "String1", "String2" )
  3.  CONCATENATE ( CONCATENATE ( "String1", " " ), "String2" )

Since there is no obvious way to write DAX, programmers develop a repertoire of techniques that can then be applied in similar situations. This blog discusses functionX functions and the Summary Pattern. This useful pattern should be in every DAX programmer’s repertoire.

FunctionX functions

DAX has several functions that come in function and functionX flavours:

SUMSUMX
MINMAXX
MAXMINX
COUNTCOUNTX
COUNTACOUNTAX
 CONCATENATEX
AVERAGEAVERAGEX

CONCATENATE has been left off the list above because it is an exception. All the others have the following syntax:

SUM(<'Table Name'><[Column Name]>)
SUMX(<'Table Name'>,<Column Expression>) 

Where <Column Expression> is either an arithmetic expression or a string-handling expression.The following two expressions will produce the same result

SUM(‘Sales’[Amount])
SUMX(‘Sales’,[Amount])

The first example can only be used with a single column parameter. The second example is more flexible, and can take an expression. For example:

SUMX(‘Sales’,1-[DiscountPercent]*[Amount])

Why have a SUM function that has such a narrow limitation, and then introduce a second function to get round the limitation?

DAX was designed for a columnar database. In columnar databases, the individual columns of a table are compressed and stored as separate data structures. If more than one column is accessed the table needs to be assembled from these separate data structures, and this is resource intensive. If only one column is accessed this isn’t necessary, and the query is much more efficient.  While there appears to be no performance difference between SUM('Sales'[Amount]) and SUMX('Sales',[Amount]), when more than one column is involved it is better DAX to write SUM(‘Sales’[Amount]) + SUM(‘Sales’[Postage]) than SUMX(‘Sales,[Amount]+[Postage]), even if, in practice, there is no difference in performance.

In many cases the logic requires the use of a SUMX function, and this won’t necessarily cause a performace issue. The measure SUMX('Sales',1-[DiscountPercent]*[Amount]) will generally perform well if [DiscountPercent] is a column or quick measure. If [DiscountPercent] is a computationally-intensive measure, performance may not be acceptable.

The Summary Pattern discussed below offers a solution to this. It combines the multi-column functionality of functionX functions with the speed of single-column functions. However, before looking at the Summary Pattern, it’s useful to have a good understanding of how to summarise data in DAX.

The above approach is recommended by Marco Russo and Alberto Ferrari, and others.

Summarising data in DAX

Function and FunctionX functions may initially appear confusing, but at least there is a logic behind it. Not only does DAX have no single equivalent to a SQL SELECT statement, there is also no obvious reason for the various alternative commands.

This blog uses the SUMMARIZECOLUMNS command because it is well-optimised by the DAX engine, it is the most flexible of the options listed above, is in some ways like the SQL SELECT statement, and it is used internally by PowerBI visuals. An example of this command is:

EVALUATE SUMMARIZECOLUMNS (
    Customer[CustomerKey],
    Customer[Customer ID],
    Customer[Customer],
    Customer[City],
    Customer[State-Province],
    Customer[Country-Region],
    Customer[Postal Code]
)

The statements below all return the same result set as the SUMMARIZECOLUMNS command above (Examples are based on the AdventureWorks pbix).

EVALUATE Customer

EVALUATE CALCULATETABLE (Customer)

EVALUATE DISTINCT ( Customer )

EVALUATE VALUES ( Customer )

EVALUATE SELECTCOLUMNS (
    Customer,
    Customer[CustomerKey],
    Customer[Customer ID],
    Customer[Customer],
    Customer[City],
    Customer[State-Province],
    Customer[Country-Region],
    Customer[Postal Code]
)

EVALUATE SUMMARIZE (
    Customer,
    Customer[CustomerKey],
    Customer[Customer ID],
    Customer[Customer],
    Customer[City],
    Customer[State-Province],
    Customer[Country-Region],
    Customer[Postal Code]
)

EVALUATE FILTER ( Customer, Customer[CustomerKey] )

The DAX

EVALUATE SUMMARIZECOLUMNS (
    Customer[CustomerKey],
    Customer[Customer ID],
    Customer[Customer],
    Customer[City],
    Customer[State-Province],
    Customer[Country-Region],
    Customer[Postal Code]
)


is equivalent to the SQL statement:

SELECT
 Customer[CustomerKey],
 Customer[Customer ID],
 Customer[Customer],
 Customer[City],
 Customer[State-Province],
 Customer[Country-Region],
 Customer[Postal Code]
FROM Customer
GROUP BY
 Customer[CustomerKey],
 Customer[Customer ID],
 Customer[Customer],
 Customer[City],
 Customer[State-Province],
 Customer[Country-Region],
 Customer[Postal Code]

Or in general terms,

EVALUATE SUMMARIZECOLUMNS (
<Column
 Block>
)

Is the equivalent of

SELECT
<Column
 Block>
FROM <all tables mentioned in column block, cross joined>
GROUP BY
<Column
 Block>

This is true even if there is a relationship between the tables mentioned in the column block. The AdventureWorks database has a one-to-many relationship between Customer and Sales, but

EVALUATE SUMMARIZECOLUMNS (
    Customer[CustomerKey],
    Sales[CustomerKey]  
)

will result in a cross product between the Customer table and the Sales table.

It is reasonable to assume that the SUMMARIZECOLUMNS command allows the results to be filtered, and that something like the code below would work:

EVALUATE
SUMMARIZECOLUMNS (
    Customer[CustomerKey],
    Customer[Customer ID],
    Customer[Customer],
    Customer[City],
    Customer[State-Province],
    Customer[Country-Region],
    Customer[Postal Code],
    Customer[CustomerKey] = 11000)

Unfortunately, not only does this throw and error, but the error message is obscure:

A single value for column ‘CustomerKey’ in table ‘Customer’ cannot be determined. This can happen when a measure formula refers to a column that contains many values without specifying an aggregation such as min, max, count, or sum to get a single result.

EVALUATE
SUMMARIZECOLUMNS (
    Customer[CustomerKey],
    Customer[Customer ID],
    Customer[Customer],
    Customer[City],
    Customer[State-Province],
    Customer[Country-Region],
    Customer[Postal Code],
    {11000})

The above code does not throw an error, but the hard-coded 11000 does not filter the result either. It is just ignored because DAX values have a column-type. In this example, {1000} is interpreted as a new table and does  not filter any existing table. This is changed in example below

EVALUATE
SUMMARIZECOLUMNS (
    Customer[CustomerKey],
    Customer[Customer ID],
    Customer[Customer],
    Customer[City],
    Customer[State-Province],
    Customer[Country-Region],
    Customer[Postal Code],
    TREATAS ( { 11000 }, Customer[CustomerKey] )
)

This works and returns the single row expected. The filter values are often stored in a variable:

DEFINE VAR _Filter = TREATAS ( { 11000 }, Customer[CustomerKey] )
EVALUATE
SUMMARIZECOLUMNS (
    Customer[CustomerKey],
    Customer[Customer ID],
    Customer[Customer],
    Customer[City],
    Customer[State-Province],
    Customer[Country-Region],
    Customer[Postal Code],
    _Filter
)

While TREATAS does not support logical comparisons, these can be done in the FILTER and CALCULATETABLE commands.

DEFINE
 VAR _Filter = FILTER(VALUES(Customer[CustomerKey]), Customer[CustomerKey] < 11010)

EVALUATE
SUMMARIZECOLUMNS (
    Customer[CustomerKey],
    Customer[Customer ID],
    Customer[Customer],
    Customer[City],
    Customer[State-Province],
    Customer[Country-Region],
    Customer[Postal Code],
    _Filter
)

For a more in-depth discussion, see Filtering Data .

Interestingly, while

DEFINE
 VAR _Filter = FILTER(VALUES(Sales[CustomerKey]), Sales[CustomerKey] = 11010)

EVALUATE
SUMMARIZECOLUMNS (
    Customer[CustomerKey],
    Sales[CustomerKey],
    _Filter
)

Results in the expected cross-product, with Sales[Customerkey) filtered by the _Filter. However, the DAX:

DEFINE
 VAR _Filter = FILTER(VALUES(Sales), Sales[CustomerKey] = 11010)

EVALUATE
SUMMARIZECOLUMNS (
    Customer[CustomerKey],
    Sales[CustomerKey],
    _Filter
)

Converts the cross join into an inner join.

The aggregated values can now be added.  The syntax is:

EVALUATE
SUMMARIZECOLUMNS (
    Customer[CustomerKey],
    Customer[Customer ID],
    Customer[Customer],
    Customer[City],
    Customer[State-Province],
    Customer[Country-Region],
    Customer[Postal Code],
    "Aggregate Name",<measure name or expression>
)

There is another useful side-effect here – rows are filtered if all the measures are blank. So

EVALUATE
SUMMARIZECOLUMNS (
    Customer[CustomerKey],
    Sales[CustomerKey],
    "Aggregate Name",Blank()
)

Will never return any data.  This is often used convert the cross-join into an inner join, because the measure calculation will take the relationship between Customer and Sales into account.

EVALUATE
SUMMARIZECOLUMNS (
    Customer[CustomerKey],
    Sales[CustomerKey],
    "#Sales",COUNT(Sales[CustomerKey])
)

This returns one row per CustomerKey, with total number of sales as [#Sales].  Using a blank measure to convert the cross join to an inner join will only work if the measure calculations for rows other than inner join rows all return blank values. In the example below one measure is hard-coded and will always return a value, so this DAX will still return the cross-join.

EVALUATE
SUMMARIZECOLUMNS (
    Customer[CustomerKey],
    Sales[CustomerKey],
    "#Sales",COUNT(Sales[CustomerKey]),
    "NonBlank",1
)

The Summary Pattern

Single-column functions like SUM or MAX are resource-efficient, but are limited to accessing only one column. The corresponding iterators, SUMX or MAXX, can perform operations between columns but consume more resources and can be very slow.

The summary pattern gets the best of both worlds by using single-column functions to do the heavy lifting, and iterators to then perform the operations between columns.

Foreign exchange calculations are a common use-case.

For example, if the database had following Sales table:

ProductKeyCurrencyCodeSalesAmount
101100USD1.234
101101GBP0.99

The measure:

[TotalSales] = SUM(‘Sales’[SalesAmount])

Would give misleading information because it sums value in differnt currencies.  If there was a measure [Exchange Rate], the following would do the exchange conversion:

[TotalSales]=SUMX('Sales','Sales'[SalesAmount]*[Exchange Rate])

This code may be unacceptably slow if the ‘Sales’ table is large. While there may be millions of rows in the Sales table, there are fewer than 200 different currencies in the world. So summarising the data per currency will return fewer than 200 rows:

EVALUATE
SUMMARIZECOLUMNS (
    Sales[CurencyCode],
    "@TotalSales",CALCULATE(SUM(Sales[SalesAmount])),
    "@ExchangeRate",[Exchange Rate]
)

SUMX over this small table will be fast. The measure can now be written as:    

MEASURE 'Sales'[TotalSales] =
    SUMX (
        SUMMARIZECOLUMNS (
            Sales[CurencyCode],
            "@TotalSales", CALCULATE ( SUM ( Sales[SalesAmount] ) ),
            "@ExchangeRate", [Exchange Rate]
        ),
        [@TotalSales] * [@ExchangeRate]
    )

For clarity, this is often written as:

'Sales'[MeasureName] =
    VAR _Summary =
        SUMMARIZECOLUMNS (
          Sales[CurencyCode],
          "@TotalSales", CALCULATE ( SUM ( Sales[SalesAmount] ) ),
          "@ExchangeRate", [Exchange Rate]
        )
    RETURN
        SUMX ( _Summary, [@TotalSales] * [@ExchangeRate] )

This is the Summary Pattern. First summarize the data into as small a table as possible using SUM, and then use SUMX on this small table to do the column manipulation. This offers the speed of SUM and the flexibility of SUMX. This pattern is also useful for nesting aggrate functions – a SUM of a MAX, for example.

Five ways of combining several fact tables in a PowerBI report

Power BI reports often combine several fact tables, typically from several sources. Videos on multiple fact tables often claim one method as ‘best practice’ and deprecate alternative approaches. Different approaches have different strengths and weaknesses, and in this blog I discuss five ways separate approaches to combining several fact tables in a single report. The advantages and limitations of each are highlighted.

Continue reading