Inside FetchXML pt 5 - aggregate

One feature of FetchXML that sets it apart from other CDS query options is aggregates. This is the option to get summary data (number of records, sum of opportunity value etc.) rather than the individual values from each record. If your only concept of how many records you have is “5,000+” then this is for you.

Let’s take a look at an example to get the most common first names of your active contacts:

<fetch aggregate="true">
  <entity name="contact">
    <attribute name="firstname" alias="firstname" groupby="true" />
    <attribute name="contactid" alias="occurrences" aggregate="count" />
    <filter>
      <condition attribute="statecode" operator="eq" value="0" />
    </filter>
    <order alias="occurrences" descending="true" />
  </entity>
</fetch>

There are 4 main points to note that are different from a regular query:

The <fetch> element has the aggregate="true" attribute
All the <attribute> elements have an alias. This is required for all aggregate queries, and is the one exception to my “don’t use attribute aliases” rule.
All the <attribute> elements also have EITHER a groupby="true" or aggregate="xxx" attribute
The <order> element uses an alias="xxx" attribute to refer to the previously aliased <attribute> instead of the normal attribute="xxx"

Run this query and you’ll get a list of different firstnames along with the number of active contacts that have that name. The list will be sorted with the most popular name at the top.

Available Aggregates

There are only a few aggregate functions you can apply:

count
countcolumn
min
max
sum
avg

If you’re familiar with SQL these will mostly be familiar as the ways of retrieving the number of records, and the lowest, highest, total and average values in a column.

The one that’s not so obvious is the distinction between count and countcolumn. The only reference I could find for this was a blog dating back to CRM 4.0. In brief, count will give the total number of records, while countcolumn gives the number of records that have a value in the attribute you apply it to. Note that it doesn’t matter which column you apply count to, you’ll get the same total number of records either way.

Distinct Counts

The countcolumn aggregate also supports a distinct attribute. This lets you get the number of different values in that column, e.g.

<fetch aggregate="true">
  <entity name="contact">
    <attribute name="address1_country" alias="country" groupby="true" />
    <attribute name="address1_city" alias="cities" aggregate="countcolumn" distinct="true" />
    <filter>
      <condition attribute="statecode" operator="eq" value="0" />
    </filter>
    <order alias="cities" descending="true" />
  </entity>
</fetch>

This will get a list of countries and the number of different cities in each country.

What “different” means in this context depends on the collation settings for your organisation. For the English installations that I have access to this defaults to ignoring case and accents, so “Münich” and “MUNICH” would be counted as being the same.

Filtering

You can use <filter> specifications in aggregate queries in the same way as normal. See part 3 of this series for more details about filters. The filter will be applied before the aggregate. There is no way to filter the results of the aggregation, e.g. show results where the count is greater than 1, as helpful as that would be.

Grouping

You can run an aggregate query without including any groupby="true" attributes. In this case, all the records that match your filter will be aggregated into a single result record. This is useful for finding the total number of records or highest/lowest values across all your records, e.g.

<fetch aggregate="true">
  <entity name="opportunity">
    <attribute name="opportunityid" alias="count" aggregate="count" />
    <attribute name="estimatedvalue" alias="min" aggregate="min" />
    <attribute name="estimatedvalue" alias="max" aggregate="max" />
    <filter>
      <condition attribute="statecode" operator="eq" value="0" />
    </filter>
  </entity>
</fetch>

This query gives the number of open opportunities, along with the minimum and maximum values of those opportunities.

However, it is often more useful to group the records to see the differences in the summary values in different groups. For example, you can see the pipelines of different sales people by grouping opportunities by their owners and getting the total value of the opportunities in each group:

<fetch aggregate="true">
  <entity name="opportunity">
    <link-entity name="systemuser" to="ownerid" from="systemuserid" link-type="inner">
      <attribute name="fullname" alias="fullname" groupby="true" />
      <order alias="fullname" />
    </link-entity>
    <attribute name="estimatedvalue" alias="value" aggregate="sum" />
  </entity>
</fetch>

This is the first example we’ve looked at that uses a <link-entity>, but they work in aggregate queries in exactly the same way as in regular queries.

Ordering

It’s very useful to sort the results of aggregate queries. For producing reports such as the opportunity pipeline query earlier, the results could be sorted by user to make it easy to find a particular individual, or by the pipeline value to highlight the best or worst achievers.

With the first example to find the number of contacts with each first name, sorting this to find the most and least popular names can often highlight data quality issues. Try it on your data and see what you find!

Limitations

There are a few limitations to aggregate queries you should be aware of

Maximum of 50,000 records. If you try to run an aggregate query that uses more than 50,000 records you’ll get the error AggregateQueryRecordLimit exceeded
No paging support. As for normal queries you can get back up to 5,000 records from your query, but if you have more groups than that there is no way to move on to the next page.

Alternatives

To get around these limitations you’ll need to break up your query with more restrictive filters and combine the results. In SQL 4 CDS I worked around these limits by retrieving all the individual records and applying the aggregation in code when necessary.

For one special case of getting the total number of records in an entity without any filters, i.e.

<fetch aggregate="true">
  <entity name="contact">
    <attribute name="contactid" alias="count" aggregate="count" />
  </entity>
</fetch>

there is a new RetrieveTotalRecordCountRequest message available since version 9 to get this more efficiently and bypassing the 50,000 record limit. SQL 4 CDS will use this optimisation where possible.

SQL Equivalents

These options are equivalent to the GROUP BY clause and aggregate functions in SQL.

The count aggregate is equivalent to count(*) in SQL, and countcolumn is equivalent to count(attributename).

The SQL HAVING clause doesn’t have a FetchXML equivalent – this is one improvement to FetchXML I’d love to see.

RetrieveTotalRecordCountRequest is equivalent to SELECT count(*) FROM entity, with no option to apply a filter. I have a feeling it may actually map to the sp_spaceused SQL Server stored procedure under the hood.

16 thoughts on “Inside FetchXML pt 5 – aggregate”

Pingback: Inside FetchXML pt 6 - datetime grouping - Mark Carrington
Pingback: SQL 4 CDS 1.0.6 Released! - Mark Carrington
John Ledger says:

10th March 2021 at 12:55 pm

Currently it seems the aggregate works on the top level entity, and not a link-entity several layers deep, that would be a great feature.

Reply
1. Mark Carrington says:
  
  10th March 2021 at 2:01 pm
  It certainly should work on a link-entity, for example to get the number of contacts in each account:
```
<fetch aggregate="true">
  <entity name="account">
    <attribute name="name" alias="name" groupby="true" />
    <link-entity name="contact" from="parentcustomerid" to="accountid">
      <attribute name="contactid" aggregate="count" alias="contacts" />
    </link-entity>
    <order alias="name" />
  </entity>
</fetch>
```
  It's quite easy to hit the 50,000 record limit for aggregate queries when you do this though, so you might need to add filters to reduce your results.
  Reply
Bill Gates says:

28th April 2021 at 5:13 pm

I’m using distinct to try and eliminate duplicate rows from my join, but I’m getting a count of the linked entity instead of the parent, aghaa:

Reply
1. Mark Carrington says:
  
  28th April 2021 at 5:27 pm
  
  I can’t see the query you’re using, but using countcolumn and distinct should get you the result you need I think
  
  Reply
  1. Bill Gates says:
    
    28th April 2021 at 5:38 pm
    
    Hi,
    
    I found the problem, I was using the same value for alias and aggregate attributes, making these unique resolved my issue.
    
    Great blog btw. I have found many of the fetch XML articles really useful, bookmarked, thank you!
    
    Reply
OkayGoogle says:

13th May 2021 at 11:26 pm

There are two more requirements I have found for aggregates:
1. Cannot group by datetime. In other words, if you have an attribute that happens to have datetime data and you set “groupby=’true'”, it will error.
2. Must include entityname inside top level condition if trying to filter on a linked entity attribute. For example:

Reply
1. Mark Carrington says:
  
  14th May 2021 at 4:22 pm
  Yes, you can’t group by a datetime column without using a datepart option. If you want to group by date you can group by the same column three times, grouping by year, month and date:
```
<fetch xmlns:generator="MarkMpn.SQL4CDS" aggregate="true">
  <entity name="account">
    <attribute name="createdon" alias="createdon_year" groupby="true" dategrouping="year" />
    <attribute name="createdon" alias="createdon_month" groupby="true" dategrouping="month" />
    <attribute name="createdon" alias="createdon_day" groupby="true" dategrouping="day" />
    <attribute name="accountid" alias="count" aggregate="count" />
  </entity>
</fetch>
```
  Reply
MaJellin says:

16th November 2021 at 10:51 pm

I’m running into the AggregateQueryRecordLimit exception when I try to do a count distinct similar to “” because the query involves multiple joins and each match is adding to the initial results set which is not distinct. In the sql profiler it appears to be doing a SELECT COUNT(DISTINCT _idname_) as “alias” from (select top 50001 _idname_ …joins…) and then throwing the exception if the “select top 50001” row count is == 50001. If I run the sql directly, I see that the actual count result is considerably less than 50001, its just the intermediate, non-distinct query which has over 50001 rows. Is there any way to prevent this behavior? Perhaps a way to instruct the query to use “DISTINCT” on the intermediate query? Thanks 🙂

Reply
1. Mark Carrington says:
  
  17th November 2021 at 7:50 am
  
  Can you try posting your original query again please, it didn’t seem to have come through in your comment correctly. If it still doesn’t appear, please add it as an issue at https://github.com/MarkMpn/Sql4Cds/issues/new
  
  Reply
  1. MaJellin says:
    
    17th November 2021 at 1:51 pm
    Sorry about that, here it is.
    
    <fetch version="1.0" output-format="xml-platform" mapping="logical" distinct="true" aggregate="true" page="1" count="50001" > <entity name="entityA"> <attribute name="entityAid" aggregate="countcolumn" alias="dcount" distinct="true" /> <filter type="and"> <condition attribute="entityB_new_course_joinid" operator="not-null" value="0" entityname="genAlias_0" /> <condition attribute="new_contact_accountid" operator="not-null" value="0" entityname="genAlias_2" /> </filter> <link-entity name="entityD" from="entityDid" to="new_entityDid_entityA" link-type="inner" alias="entityD_0" > <link-entity name="entityC" from="entityCid" to="entityC_courseinstance" link-type="inner" alias="entityC_0" > <link-entity name="entityB_entityC_join" from="new_entityCid_entityB_entityC_join" to="entityCid" link-type="outer" alias="genAlias_0" > <link-entity name="entityB" from="entityBid" to="new_entityBid_entityB_entityC_join" link-type="inner" alias="entityB_0" > <filter type="or"> <condition attribute="entityBid" operator="eq-or-under" value="e8d05f28-59cd-eb11-a2df-0050568beee8" /> <condition attribute="entityBid" operator="eq-or-under" value="11fce101-59cd-eb11-a2df-0050568beee8" /> </filter> </link-entity> </link-entity> </link-entity> </link-entity> <link-entity name="contact" from="contactid" to="new_contactid_entityA" link-type="inner" alias="contact_1" > <link-entity name="new_contact_account" from="new_contactid_new_contact_account" to="contactid" link-type="outer" alias="genAlias_2" > <link-entity name="account" from="accountid" to="new_accountid_new_contact_account" link-type="inner" alias="account_0" > <filter type="or"> <condition attribute="accountid" operator="under" value="394d3236-23ea-e511-940a-005056834654" /> <condition attribute="accountid" operator="eq" value="394d3236-23ea-e511-940a-005056834654" /> </filter> </link-entity> </link-entity> </link-entity> </entity> </fetch>
    Reply
2. Mark Carrington says:
  
  17th November 2021 at 3:52 pm
  
  Right, this is the expected behaviour and as far as I know there’s no way to change it. I assume it’s to avoid queries taking too long/too many resources if they need to use millions of records to produce the aggregate. Because of this you can’t get a total count of records using an aggregate query if you’ve got more than 50k records, which is why the RetrieveTotalRecordCount request exists.
  
  For more complex aggregates that go over this limit though you need to retrieve the individual records in pages and calculate the aggregates yourself. This is the approach that SQL 4 CDS uses – it will try the straightforward aggregate query first, but if that hits the AggregateQueryRecordLimit exception it tries again reading the records and calculating its own aggregates.
  
  Reply
  1. MaJellin says:
    
    17th November 2021 at 4:17 pm
    
    I see. What I had been doing was catching the limit exception, adding 50000 to the total, skipping 50000 and querying the count again and again until I no longer received the exception. This works properly for queries which dont involve as many joins. However, this no longer works since, in this case for example, the actual total count is only 24,988 and skipping 50000 then brings me past the end of the results set. It’s unfortunate, but I suppose I understand. So it seems the only way to do this is to query for record id in batches of 5000 and add that up which takes significantly longer and used more memory/cpu. Thank you for your help.
    
    Reply
    1. Mark Carrington says:
      
      17th November 2021 at 9:21 pm
      
      You may also have a couple of other options depending on the details. Rather than repeatedly skipping records and re-counting, you could also repeatedly change the filter criteria. For example, if your main entity had a “name” column that was reasonably evenly distributed you could get a count of all the records where it started with “A”, then again with “B” etc., then add the counts together. This is only really feasible if you’ve got a good understanding of your data and how you can break it down with different filters, and you’d need to make it robust enough to cope with the volume of data in each bucket changing over time.
      
      A more straightforward approach though might be to remove the use of FetchXML entirely and use the TDS endpoint instead. This would let you write the “SELECT COUNT(*) FROM joins” query that you want without having to worry about record counts at all, although the query would still be subject to the 2 minute timeout limit. This also isn’t an option if you are trying to run the query from within a plugin.
      
      Reply
      1. MaJellin says:
        
        18th November 2021 at 1:01 am
        
        I’m not sure if we have access to the TDS endpoint as we are on an older version (9.0) of On Premise, but I will definitely check that out. I think I have figured out a kind of middle ground for now to get around this limitation. If I get the limit exception and skip 50000 and then get a count of zero after that, I can infer that the actual total is less than 50000, so I then (somewhat arbitrarily) subtract 5000 from my total and my skip and query for count again. If the result is again zero, I repeat until it is non-zero and then add that non-zero result to my running (in reverse) total. I compared the runtimes on a handful of queries versus the runtime to query for the full set of Ids page by page and in all cases it was significantly faster to do the “reverse skip” method.
        
        Thanks again for the reply, your posts are very helpful/knowledgeable regarding fetchXML (especially since the official docs are not)