I hit a strange situation yesterday. I was retrieving data from Dataverse with a FetchXML query but some values were missing from my results, while others were duplicated. Turns out it’s an easy situation to get into without noticing.

Paging Recap

I wrote a post last year on using paging to retrieve large amounts of data from Dataverse. In essence, you can’t retrieve more than 5,000 records at a time. Instead you have to make subsequent requests to retrieve the next 5,000, then the next, …

As well as the number of the page of results you want to retrieve, if you’re loading the pages of results sequentially, you should also include the paging cookie returned with the previous page to improve performance.

How can it go wrong?

Sounds simple so far – load the first 5,000 records, then the next 5,000. But how does the system know what the “next” 5,000 are?

Ideally you have a sort order applied to your query, so the records are in a consistent order from one request to the next. This way Dataverse can keep a note of the last record you saw (encoded in the paging cookie) and use this on your next request to jump to exactly the right position in the sorted list.

If you don’t apply a sort order yourself, Dataverse will add one silently for you behind the scenes, so the results will be sorted by the unique ID column.

However, if you include the distinct option, a sort order will not be added automatically. This leaves you open to getting missing or duplicated records after you’ve retrieved all the pages.

Behind the scenes at this point Dataverse is running a SQL command like:

SELECT DISTINCT TOP 5001 c.fullname
FROM   account
       INNER JOIN
       contact AS c
       ON account.accountid = c.parentcustomerid

For each page of data you ask for, Dataverse simply increases the TOP count in this query, throws away the records for the previous pages and then returns the last page. The problem is, SQL doesn’t enforce any particular ordering of the results of this query, so the results could look like:


In this example the calling application will see records B and C duplicated, and will never see records D or F.

How much of a problem is it really?

You might think it sounds unlikely enough that you don’t need to worry about it, but I can replicate this problem quite easily with just a few thousand accounts and contacts. If you want to get all the different names of contacts in accounts you might use:

<fetch xmlns:generator="MarkMpn.SQL4CDS" distinct="true">
  <entity name="account">
    <link-entity name="contact" to="accountid" from="parentcustomerid" alias="c" link-type="inner">
      <attribute name="fullname" />
    </link-entity>
  </entity>
</fetch>

If you check the total number of results you’ll see it’s correct, but if you do your own simple de-duplication of the names you’ll find some names are repeated. As the total number is right and some are duplicated, then the same number must also be missing.

How to fix it?

Luckily it’s quite simple to fix, you just need to add a sort order to your query. Ideally you want to sort on something that will be unique between each record in the results, so I’d change the query above to become:

<fetch xmlns:generator="MarkMpn.SQL4CDS" distinct="true">
  <entity name="account">
    <link-entity name="contact" to="accountid" from="parentcustomerid" alias="c" link-type="inner">
      <attribute name="fullname" />
      <order attribute="fullname" />
    </link-entity>
  </entity>
</fetch>

Chances are if you’re doing this sort of query in your application then you’ll have a helper function to manage the paging, so I’d also recommend adding a check into that function to ensure a sort order is applied to the query before it runs so nothing accidentally slips though the cracks:

if (MissingRequiredSort(queryExpression))
  throw new NotSupportedException("QueryExpression missing sort on DISTINCT query");

private bool MissingRequiredSort(QueryExpression qe)
{
  if (!qe.Distinct)
    return false;

  if (qe.Orders.Count > 0)
    return false;

  return qe.LinkEntities.All(le => MissingRequiredSort(le));
}

private bool MissingRequiredSort(LinkEntity le)
{
  if (le.Orders.Count > 0)
    return false;

  return le.LinkEntities.All(childLink => MissingRequiredSort(childLink));
}

There will be an update coming to SQL 4 CDS shortly that will do this check to ensure you don’t get unexpected results.

One thought on “Dataverse Paging with Distinct”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.