AI is driving change – this blog looks at how AI generated Vega/Vega-lite may impact Power BI development.
Introducing Vega/Vega-lite
Vega and Vega-lite (a simplified version of Vega) are visualization grammars – JSON schemas for specifying how data should be presented visually. Vega/Vega-lite allows almost any visualisation to be specified. Power BI currently supports a large and growing number of visualizations – but is it really necessary to have some many? Is it possible to have one visuallization that can be customised to do whatever is needed?
One of the principles of Python is “There should be one, and preferably only one, obvious way to do it”. DAX isn’t like this. DAX has many functions that do similar things. For example, joining two strings with a space between them can be done in three different ways:
Since there is no obvious way to write DAX, programmers develop a repertoire of techniques that can then be applied in similar situations. This blog discusses functionX functions and the Summary Pattern. This useful pattern should be in every DAX programmer’s repertoire.
FunctionX functions
DAX has several functions that come in function and functionX flavours:
SUM
SUMX
MIN
MAXX
MAX
MINX
COUNT
COUNTX
COUNTA
COUNTAX
CONCATENATEX
AVERAGE
AVERAGEX
CONCATENATE has been left off the list above because it is an exception. All the others have the following syntax:
Where <Column Expression> is either an arithmetic expression or a string-handling expression.The following two expressions will produce the same result
SUM(‘Sales’[Amount])
SUMX(‘Sales’,[Amount])
The first example can only be used with a single column parameter. The second example is more flexible, and can take an expression. For example:
SUMX(‘Sales’,1-[DiscountPercent]*[Amount])
Why have a SUM function that has such a narrow limitation, and then introduce a second function to get round the limitation?
DAX was designed for a columnar database. In columnar databases, the individual columns of a table are compressed and stored as separate data structures. If more than one column is accessed the table needs to be assembled from these separate data structures, and this is resource intensive. If only one column is accessed this isn’t necessary, and the query is much more efficient. While there appears to be no performance difference between SUM('Sales'[Amount]) and SUMX('Sales',[Amount]), when more than one column is involved it is better DAX to write SUM(‘Sales’[Amount]) + SUM(‘Sales’[Postage]) than SUMX(‘Sales,[Amount]+[Postage]), even if, in practice, there is no difference in performance.
In many cases the logic requires the use of a SUMX function, and this won’t necessarily cause a performace issue. The measure SUMX('Sales',1-[DiscountPercent]*[Amount]) will generally perform well if [DiscountPercent] is a column or quick measure. If [DiscountPercent] is a computationally-intensive measure, performance may not be acceptable.
The Summary Pattern discussed below offers a solution to this. It combines the multi-column functionality of functionX functions with the speed of single-column functions. However, before looking at the Summary Pattern, it’s useful to have a good understanding of how to summarise data in DAX.
Function and FunctionX functions may initially appear confusing, but at least there is a logic behind it. Not only does DAX have no single equivalent to a SQL SELECT statement, there is also no obvious reason for the various alternative commands.
This blog uses the SUMMARIZECOLUMNS command because it is well-optimised by the DAX engine, it is the most flexible of the options listed above, is in some ways like the SQL SELECT statement, and it is used internally by PowerBI visuals. An example of this command is:
SELECT
Customer[CustomerKey],
Customer[Customer ID],
Customer[Customer],
Customer[City],
Customer[State-Province],
Customer[Country-Region],
Customer[Postal Code]
FROM Customer
GROUP BY
Customer[CustomerKey],
Customer[Customer ID],
Customer[Customer],
Customer[City],
Customer[State-Province],
Customer[Country-Region],
Customer[Postal Code]
Or in general terms,
EVALUATE SUMMARIZECOLUMNS (
<Column
Block>
)
Is the equivalent of
SELECT
<Column
Block>
FROM <all tables mentioned in column block, cross joined>
GROUP BY
<Column
Block>
This is true even if there is a relationship between the tables mentioned in the column block. The AdventureWorks database has a one-to-many relationship between Customer and Sales, but
Unfortunately, not only does this throw and error, but the error message is obscure:
A single value for column ‘CustomerKey’ in table ‘Customer’ cannot be determined. This can happen when a measure formula refers to a column that contains many values without specifying an aggregation such as min, max, count, or sum to get a single result.
The above code does not throw an error, but the hard-coded 11000 does not filter the result either. It is just ignored because DAX values have a column-type. In this example, {1000} is interpreted as a new table and does not filter any existing table. This is changed in example below
Will never return any data. This is often used convert the cross-join into an inner join, because the measure calculation will take the relationship between Customer and Sales into account.
This returns one row per CustomerKey, with total number of sales as [#Sales]. Using a blank measure to convert the cross join to an inner join will only work if the measure calculations for rows other than inner join rows all return blank values. In the example below one measure is hard-coded and will always return a value, so this DAX will still return the cross-join.
Single-column functions like SUM or MAX are resource-efficient, but are limited to accessing only one column. The corresponding iterators, SUMX or MAXX, can perform operations between columns but consume more resources and can be very slow.
The summary pattern gets the best of both worlds by using single-column functions to do the heavy lifting, and iterators to then perform the operations between columns.
Foreign exchange calculations are a common use-case.
For example, if the database had following Sales table:
ProductKey
CurrencyCode
SalesAmount
101100
USD
1.234
101101
GBP
0.99
The measure:
[TotalSales] = SUM(‘Sales’[SalesAmount])
Would give misleading information because it sums value in differnt currencies. If there was a measure [Exchange Rate], the following would do the exchange conversion:
This code may be unacceptably slow if the ‘Sales’ table is large. While there may be millions of rows in the Sales table, there are fewer than 200 different currencies in the world. So summarising the data per currency will return fewer than 200 rows:
This is the Summary Pattern. First summarize the data into as small a table as possible using SUM, and then use SUMX on this small table to do the column manipulation. This offers the speed of SUM and the flexibility of SUMX. This pattern is also useful for nesting aggrate functions – a SUM of a MAX, for example.
Power BI reports often combine several fact tables, typically from several sources. Videos on multiple fact tables often claim one method as ‘best practice’ and deprecate alternative approaches. Different approaches have different strengths and weaknesses, and in this blog I discuss five ways separate approaches to combining several fact tables in a single report. The advantages and limitations of each are highlighted.
Power BI has several visuals that are often used to show size comparisons. The visuals below show the relative sizes of Alaska, California and Texas using a pie chart, donut chart, tree view and a stacked bar chart. Graphic designers may argue about which is best, but they all do the job.
Many Corporations have guidelines on how to present a consistent brand identity in visual communication. Some Microsoft products, such as Word and PowerPoint, support the use of templates that make corporate branding easy to achieve. Is there something similar for Power BI?