CDS T-SQL Endpoint pt 6 – Aggregates

One of the things that is very frustrating when trying to do any sort of analysis of data in CDS is the fetch aggregate limit. If you haven’t come across this before, this is a hard limit of 50,000 records that it will process when trying to calculate aggregate values such as sum, count etc. If you try to run a query in FetchXML Builder for example that needs to process more than 50k records you’ll get an “AggregateQueryRecordLimit exceeded” error.

The documentation advises working around this error by breaking up your query into smaller chunks and then aggregating the results. This will work, but involves a lot more thought, time and scope for error. Improving this experience was one of my main objectives for SQL 4 CDS.

T-SQL Results

My first simple test was to do a bulk import into the Lead entity and then run a simple queyr:

SELECT count(*) FROM lead

The result was simple but very pleasing:

Woohoo! It looks like the aggregate limit has gone for T-SQL!

Just to check this isn’t a special case for this one query (which you can do via the SDK with the RetrieveTotalRecordCountRequest), I also tried:

SELECT firstname, COUNT(*) FROM lead GROUP BY firstname ORDER BY 2 DESC

and a number of my other common data quality checks that use aggregation and they all worked correctly. 👍

New Options

FetchXML provides the most common aggregate options:

  • COUNT
  • MIN
  • MAX
  • SUM
  • AVG

These can be used across the entire result set, or you can use GROUP BY to get the results for each group. Another bugbear of mine has been the lack of an equivalent to the HAVING clause.

So I was pleased to find we now also have support for:

  • all the rest of the aggregate functions supported by T-SQL to do more advanced analysis such as standard deviation and variance
  • a HAVING clause to filter the aggregated results
  • GROUP BY xxx WITH ROLLUP to generate sub-totals and grand totals
  • window functions to generate running totals and row numbers in non-aggregated results

Thoughts

So this gets me to thinking, why is it different in T-SQL than FetchXML? Both are, after all, querying the same underlying database.

The aggregate query limit has been around for a long time – is this just a hold-over from when the hardware couldn’t cope with this as well as it can today, and could be removed entirely now?

Or is it that the T-SQL endpoint is running in such a fundamentally different way than the rest of the platform that the limit is either not necessary or too difficult to implement?

My gut feeling is that it’s more towards the later. After seeing some of the other queries that are now possible that aren’t in FetchXML, and some of the ways that it’s not returning the results I’d expect just yet, my current thought is that the T-SQL endpoint does some re-writing of the incoming SQL query to make it safe, then passes it on to the underlying Azure SQL Database. With the variety of ways that you can write a SQL query it must be very hard to ensure that you get the same results after the rewriting that you’d expect without it, so this might be a case of caution on the part of the product team to avoid breaking too many queries.

But at the same time I have to ask, if we can get this now via T-SQL, can we have the limit removed for FetchXML too please? 🙏

CDS T-SQL endpoint pt 5 – EntityFramework / ORM

Now I can reliably connect to the CDS T-SQL endpoint and execute queries against it, I want to get the data into some objects for me to use. I currently use EntityFramework and Dapper for this in other projects, so I’ll take a look at those two.

Entity Framework

Update 2020-05-24: Microsoft have confirmed that Entity Framework is not supported with CDS, which is disappointing but understandable considering it’s not a direct real SQL connection. I’ve left the details of my investigations below for interest, but if you’re looking for an alternative try skipping on down to the section on Dapper.

To get started, in Visual Studio I selected Add New Item and selected “ADO.NET Entity Data Model” and used the “EF Designer from database” option. The first thing you’ve got to do here is set up the connection. With my connection details in hand I split them up to fit the fields on the connection dialog – server name is contoso.crm.dynamics.com,5558, authentication method is Active Directory Password Authentication and entered my username and password. Click Test Connection and it says it’s succeeded, but I can’t click OK just yet:

So far I haven’t selected a database name. No matter, I’ll just click this drop down and… oh, there’s nothing there. OK, I’ve already seen in SSMS that the database name is the same as the organization name (e.g. contoso, or more likely org12345678), so I’ll type this in instead. Great, the OK button is enabled. Click that and…

It’s obviously connected, but it looks like the EF tooling is now sending a SQL command to CDS that CDS doesn’t support.

Next I tried connecting to a regular SQL Server database to create a model, then switched the connection string over to point to CDS and add the tables I want in that way. This also failed, initially giving me the same error as before then:

So, I don’t think this is ready to go with Entity Framework just yet. But I’m still curious what it’s failing on. From the last error message the INFORMATION_SCHEMA.VIEWS view is not available, but what other query didn’t it like before when I got the “Unsupported Sql query” error? Is there something in there that I need to be aware of for my own queries?

I tried setting up a new model against an actual SQL Server database again, but this time with the profiler open so I could see what queries the Entity Framework designer was running, then ran those same queries myself one at a time against CDS in SSMS and found a few bugs in how it handles them.

CASE expressions

The first query it runs is:

SELECT CASE WHEN Convert(varchar, SERVERPROPERTY(N'EDITION')) = 'SQL Data Services' OR SERVERPROPERTY(N'EngineEdition') = 5 THEN 1 ELSE 0 END

presumably to check if the SQL Server is a supported version. If I run this against SQL Server I get a result of 0. CDS however gives the individual server properties as separate columns and not the calculated value from the CASE expression.

I’ve checked a similar query against regular CDS data and it did work as expected, so I’m not sure why the first one here is failing:

Listing Databases

Next up it runs one query to get the list of available databases, and another to get the current database name:

SELECT name FROM master.dbo.sysdatabases WHERE HAS_DBACCESS(name) = 1 ORDER BY name

SELECT DB_NAME()

The first of these gives the error:

Msg 40000, Level 16, State 1, Line 6
Invalid table name sysdatabases. Line:2, Position:18

and the second one gives:

Msg 40000, Level 16, State 1, Line 9
Unsupported Sql query.

OK, we’ve found the source of our error message! Hopefully it should be possible to get this one fixed before this feature is out of preview.

Version Information

Carrying on though to see if there are any more queries that might cause it a problem, it next runs SET LOCK_TIMEOUT 30000 which it handles fine, then another version check query that has the same bug as before.

Next it runs this query to get a whole bunch of information about the server and connection environment:

DECLARE @edition sysname;
SET @edition = cast(SERVERPROPERTY(N'EDITION') as sysname);
SELECT case when @edition = N'SQL Azure' then 2 else 1 end as 'DatabaseEngineType',
SERVERPROPERTY('EngineEdition') AS DatabaseEngineEdition,
SERVERPROPERTY('ProductVersion') AS ProductVersion,
@@MICROSOFTVERSION AS MicrosoftVersion;
select host_platform from sys.dm_os_host_info
if @edition = N'SQL Azure' 
  select 'TCP' as ConnectionProtocol
else
  exec ('select CONVERT(nvarchar(40),CONNECTIONPROPERTY(''net_transport'')) as ConnectionProtocol')

This fails because it doesn’t understand the @@MICROSOFTVERSION variable or the sys.dm_os_host_info DMV.

Database listing

Next up it tries to switch to the requested database using

use [contoso];

Surprisingly this also fails with:

Msg 40000, Level 16, State 1, Line 28
Unsupported Sql query.

Next it tries to determine if this is a contained database using:

if (db_id() = 1)
begin
-- contained auth is 0 when connected to master
select 0
end
else
begin
-- need dynamic sql so that we compile this query only when we know resource db is available
exec('select case when authenticating_database_id = 1 then 0 else 1 end from
sys.dm_exec_sessions where session_id = @@SPID')
end

Just about everything about this query fails unfortunately. db_id(), SELECT 0 and sys.dm_exec_sessions all produce errors.

Next it tries to check if we have access to the requested database using:

SELECT CASE WHEN has_dbaccess(N'contoso') = 1 THEN 'true' ELSE 'false' END

This gives another Unsupported Sql query error. Just to check, I simplified this to:

SELECT CASE WHEN 1 = 1 THEN 'true' ELSE 'false' END

and this also fails. Giving it an alias does let it run, but always produces NULL, so there’s definitely something broken with how it handles CASE statements.

Lastly there’s a couple more queries to get some more environmental information:

SELECT
user_name() AS [UserName],
(select schema_name()) AS [DefaultSchema]

SELECT
CAST(
        serverproperty(N'Servername')
       AS sysname) AS [Name],
CAST(
        ISNULL(serverproperty(N'instancename'),N'')
       AS sysname) AS [InstanceName]

Both of these queries run, but return NULL for everything except the server name. I feel it should be able to return useful values for the first query, and even if the instancename is NULL then it really should have been converted to empty string by the ISNULL function.

Having looked through the queries that the tooling is generating here I’m not altogether surprised that there was something in there that failed. It seems to me though that there are some key things here that really need to be addressed before this is out of preview, especially the handling of CASE expressions, USE [dbname] statements that might be frequently generated by existing code, un-aliased expressions like SELECT 1 + 1 and common functions like ISNULL and COALESCE that run without error but don’t produce the expected result.

Dapper

Dapper is a much simpler system than Entity Framework. You’re still in control of the SQL that gets executed, but Dapper handles mapping the query results into your object model.

Taking my example from earlier I can easily execute:

var leads = con.Query<Lead>("SELECT leadid, firstname, lastname, companyname, emailaddress1 FROM lead");

and I get a collection of Lead objects. Of course, I first have to have defined the Lead class somewhere, and this is where Daryl LaBar’s Early Bound Generator XrmToolBox tool comes in to quickly generate all these classes.

Things get more complicated if I want to include a lookup value in my query. For example, if I add ownerid to my previous example:

var leads = con.Query<Lead>("SELECT leadid, firstname, lastname, companyname, emailaddress1, ownerid FROM lead");

This fails with the error Error paring column 6 (ownerid=<guid>). This is because the ownerid property on the generated class is an EntityReference which Dapper can’t create from the Guid that I’m getting back from SQL.

I can get around this using the mapping functionality of Dapper, but this will get tedious very quickly. However, it does bring me onto a minor gripe I have with the data available from the SQL endpoint:

Polymorphic fields

The ownerid field is a polymorphic lookup – that is, it can point to more than one other type of entity. In this case it can point to a user or a team. Other examples in CDS are the customer lookups that can be either an account or a contact, and the Regarding fields on activities, amongst others.

When you use the IOrganizationService to retrieve data, you get these lookup fields back as an EntityReference object that includes the ID, type and name of the related entity.

In SQL it gets a little more complicated. According to the metadata of the table, the only column available is ownerid. If you actually run SELECT * FROM lead and see what columns you get back, you also get owneriddsc, owneridname, owneridtype and owneridyominame.

What does it all mean?

I had to do some digging to work out what the dsc field was. It seems that this relates to a feature from CRM 4.0 which allowed you to “soft-delete” a record, and this would contain the “deletion state code” (hence DSC) of the related record. This feature was dropped in CRM 2011, so quite why it’s exposed in a brand new feature in 2020 I’m not sure.

The name and yominame fields give us the display name of the related record without having to join to the target table. This can be helpful, and matches with the EntityReference.Name property we would get from the IOrganizationService.

The one I have a problem with is owneridtype. This gives us not the entity type name as EntityReference.LogicalName would do, but the object type code. In this case I see 8, which maps to the User entity.

The object type code is not nice to use. The documentation explicitly says not to use it, and to use the schema name in preference. Unless it’s a number I’ve seen plenty of times before I’ve got to do some searching to find out what it means. Any references to it in an AppSource submission would be rejected.

So, pretty please, can this change to give the logical name instead before this is out of preview?

Can it be easier to use?

With the Dapper & early bound model, it’s painful to consume these lookup fields as they need manually mapping.

What would make life a lot easier would be if the early bound class had separate properties for lookup field’s ID and Name properties. Something like:

class LeadSql : Lead
{
  public new Guid? OwnerId
  {
    get => base.OwnerId?.Id;
    set
    {
      if (value == null)
      {
        base.OwnerId = null;
        return;
      }

      if (base.OwnerId == null)
        base.OwnerId = new EntityReference();

      base.OwnerId.Id = value.Value;
    }
  }

  public string OwnerIdType
  {
    get => base.OwnerId?.LogicalName;
    set
    {
      if (value == null)
      {
        if (value == null)
        {
          base.OwnerId = null;
          return;
        }

        if (base.OwnerId == null)
          base.OwnerId = new EntityReference();

        base.OwnerId.LogicalName = value;
      }
    }
  }
}

CDS T-SQL endpoint pt 4 – Performance

A common question I’ve seen so far about the new CDS T-SQL endpoint is how it compares performance-wise to using FetchXML. After all, FetchXML is “native” to CDS while T-SQL might be viewed as the johnny-come-lately compatibility layer, so it’s going to be slower, right?

To check it out I set up a simple test. I’ve loaded about 60,000 lead records and used SQL and FetchXML to retrieve them.

The (abbreviated) test code looks like this:

private void SQL()
{
  using (var con = new SqlConnection("<connectionstring>"))
  {
    con.Open();

    using (var cmd = con.CreateCommand())
    {
      cmd.CommandText = "SELECT leadid, firstname, lastname, companyname, emailaddress1 FROM lead";

      using (var reader = cmd.ExecuteReader())
      {
        while (reader.Read())
        {
          var leadId = reader.GetGuid(0);
          var firstName = reader.GetString(1);
          var lastName = reader.GetString(2);
          var companyName = reader.GetString(3);
          var emailAddress = reader.GetString(4);
        }
      }
    }
  }
}

private void FetchXML()
{
  using (var org = new OrganizationServiceProxy(...)
  {
    var fetch = @"
      <fetch page='1' paging-cookie=''>
        <entity name='lead'>
          <attribute name='leadid' />
          <attribute name='firstname' />
          <attribute name='lastname' />
          <attribute name='companyname' />
          <attribute name='emailaddress1' />
        </entity>
      </fetch>
    ";
    var xml = new XmlDocument();
    xml.LoadXml(fetch);
    var page = 1;

    while (true)
    {
      var result = org.RetrieveMultiple(new FetchExpression(xml.OuterXml));

      if (!result.MoreRecords)
        break;

      xml.SelectSingleNode("/fetch/@page").Value = (page++).ToString();
      xml.SelectSingleNode("/fetch/@paging-cookie").Value = result.PagingCookie;
    }
  }
}

I then ran each of these once, untimed, to allow the .NET runtime to do its initialisation thing so the results weren’t skewed by one-time startup costs.

Next I ran each one 10 times, timing each one, and took the average. The results are very encouraging:

MethodAvg. Time
SQL5.758 sec
FetchXML8.358 sec

So in this simple test, SQL comes out about 1.5x faster!

Now this is not quite fair, as the SqlConnection is automatically pooled so we don’t actually have any overhead in connecting to SQL on each attempt, but with the FetchXML test we had to open and close the connection each time. If we move the opening of each connection out of the tests so we do those once and only measure the actual query execution and result retrieval time, the results get a lot closer:

MethodAvg. Time
SQL5.026 sec
FetchXML8.129 sec

Still an impressive performance for the rookie!

Don’t put too much store in these exact numbers – I’m running these tests from the UK and my test instance is in Canada, which is going to add a lot of network overhead – but I’d expect the ratio between the two versions to be pretty consistent.

CDS T-SQL endpoint pt 3 – SqlClient

As a developer, the first thing I want to do with any new tool is try and integrate or extend it with my own code. This is always one of the real benefits of CDS and the Power Platform for me, as it’s so much more open to extensibility than others I could mention. Adding a T-SQL endpoint should just make this much wider.

The demos I’ve seen so far for this have all been around using SQL Server Management Studio. That’s great for ad-hoc queries, but what about if I want to do some regular data export/integration? If I was talking to any other SQL Server database I’d probably drop to C# code and connect to it through there, so I’ve given that a go here.

Connecting with SqlClient

You can use either System.Data.SqlClient or the newer Microsoft.Data.SqlClient to connect, either work just fine. The connection string you’ll need is in the format:

server=contoso.crm.dynamics.com,5558;user id=your-username;password=your-password;Authentication="Active Directory Password"

Note the Authentication parameter at the end to use the correct authentication method rather than the standard SQL Server authentication.

This is a quick code snippet to export the names of your account records:

using (var con = new SqlConnection("server=contoso.crm.dynamics.com,5558;user id=your-username;password=your-password;Authentication=\"Active Directory Password\""))
{
  con.Open();

  using (var cmd = con.CreateCommand())
  {
    cmd.CommandText = "SELECT name FROM account";

    using (var reader = cmd.ExecuteReader())
    {
      while (reader.Read())
      {
        Console.WriteLine(reader.GetString(0));
      }
    }
  }
}

Note that you don’t need to specify a database name in the connection string – you’ll only see one database when you connect and you’re automatically connected to that database so there’s no need to change. There’s no harm in including it though, so if you use

server=contoso.crm.dynamics.com,5558;user id=your-username;password=your-password;Authentication="Active Directory Password";database=contoso

it will still work just fine.

Broken Connection Strings

I did try a few other variants of connection strings but I only found one that didn’t seem to work so well:

server=tcp:contoso.crm.dynamics.com,5558;user id=your-username;password=your-password;Authentication="Active Directory Password"

Using the tcp: prefix on the server name is quite a common pattern in samples, but in this case you’ll get the error:

A connection was successfully established with the server, but then an error occurred during the login process. (provider: TCP Provider, error: 0 - The specified network name is no longer available.)

Application User

Having the username & password in the connection string is A Bad Thing. In most integration scenarios you should probably be using an application user instead. Azure SQL Database and now CDS supports this using the AccessToken property on the connection using code like:

var clientId = "<guid>";
var clientSecret = "<secret string>";
var aadTenantId = "<guid>";

var aadInstance = "https://login.windows.net/{0}";
var resourceId = "https://contoso.crm.dynamics.com/";

var authenticationContext = new AuthenticationContext(string.Format(aadInstance, aadTenantId));
var clientCredential = new ClientCredential(clientId, clientSecret);

var authenticationResult = await authenticationContext.AcquireTokenAsync(resourceId, clientCredential);

using (var con = new SqlConnection("server=contoso.crm.dynamics.com,5558"))
{
  con.AccessToken = authenticationResult.AccessToken;
  con.Open();

  using (var cmd = con.CreateCommand())
  {
    cmd.CommandText = "SELECT name FROM account";

    using (var reader = cmd.ExecuteReader())
    {
      while (reader.Read())
      {
        Console.WriteLine(reader.GetString(0));
      }
    }
  }
}

A few things to note here, especially if you’re copying code from an Azure SQL Database sample.

First, the AuthenticationContext class gets the access token asynchronously. Please don’t just use var authenticationResult = authenticationContext.AcquireTokenAsync(ResourceId, clientCredential).Result; as you’ll almost certainly end up with deadlocks at some point.

Second, all the Azure SQL Database sample code will use https://database.windows.net/ as the resource ID that’s passed to AcquireTokenAsync. This gives you a token that’s valid for that resource. If you use that to try to access CDS you’ll get the error:

The HTTP request is unauthorized with client authentication scheme 'Anonymous'. The authentication header received from the server was 'Bearer authorization_uri=https://login.microsoftonline.com/<guid>/oauth2/authorize, resource_id=https://contoso.crm.dynamics.com/

This is because you’re not actually accessing Azure SQL Database, but CDS pretending to be a SQL database. You therefore need to change the resource ID to be the URL of your CDS instance (https://contoso.crm.dynamics.com/ in this sample) and it should all work smoothly.

It does take some time to get the access token, so if you’re calling this code regularly make sure you cache the tokens (taking care to watch out for token expiry – the tokens are only valid for 1 hour so you’ll need to periodically get a new one).

Once you’ve got connected you can use the connection in the same way as any other SqlConnection – because that’s exactly what it is!

CDS T-SQL endpoint pt 2 – First Thoughts

Let me just say, this thing is going to be great. It’s the biggest advancement in the platform for some years, and I don’t think any competitor will come close.

However, it is a preview feature, and it does have some issues. Hopefully some of these at least will get looked at before it goes GA.

Supported Syntax

I expected simple SELECT col1, ..., coln FROM table1 INNER JOIN table2 ON ... to work, but I didn’t expect to be able to use PIVOT, APPLY, sub-queries etc which do all seem to work as expected 😀

In fact, the only query syntax I regularly use that I’ve spotted not being supported yet is CTEs. This means that querying hierarchical data is still easier in FetchXML (for now), but for other ad-hoc queries then SQL seems the obvious way to go for me.

You need aliases!

You can run this query quite happily:

SELECT *
FROM   account

This works perfectly too:

SELECT *
FROM   account AS a
       INNER JOIN
       contact AS c 
       ON a.primarycontactid = c.contactid

However, this perfectly valid SQL doesn’t:

SELECT *
FROM   account
       INNER JOIN
       contact
       ON account.primarycontactid = contact.contactid

In this case you get the error:

Msg 40000, Level 16, State 1, Line 2
The multi-part identifier "account.primarycontactid" could not be bound.
The multi-part identifier "contact.contactid" could not be bound.

It appears that, whenever you use a join, the tables MUST be aliased. Not the end of the world, but in easy gotcha that could confuse new users.

Missing tables & fields

Some data is not accessible through this endpoint. My best guess is that it’s only the data that’s stored in SQL that’s available, as the ones I’ve spotted that are missing are stored elsewhere now, e.g.

  • annotation.documentbody
  • activitymimeattachment.body
  • audit
  • plugintracelog

Even more confusingly, the documentbody and body fields appear in the results if you run SELECT * FROM annotation or SELECT * FROM activitymimeattachment respectively (although they are always null), but they do not appear in the metadata for the tables.

Ideally this data would be available too, but if not I’d like some consistency between the actual results and the metadata.

OAuth timeouts

I’m not sure if this is a general problem with Active Directory authentication for SQL Server or something specific to this implementation, but once your connection has been open for an hour any more queries will receive the error:

Msg 40000, Level 16, State 1, Line 7
The OAuth token has expired

At this point you need to force it to reconnect (in SSMS, clicking “New Query” appears to be sufficient) and carry on.

Given that this is (presumably) not running direct SQL commands, I’ve been pleasantly surprised by the range of SQL that it implements.

CDS T-SQL Endpoint pt 1 – Connecting

The biggest announcement of MBAS for me by a long way was the new T-SQL endpoint. For those who missed it, go and check out the recordings.

In short, Microsoft have added a (preview) way of querying your CDS data using T-SQL, which anyone familiar with SQL Server will already know. It was originally done to support live querying from Power BI, but because they used the same TDS protocol that SQL Server uses, immediately the entire ecosystem of SQL Server tools is opened up. So you can now use SQL Server Management Studio amongst others to query your CDS / Dynamics 365 data. Cool!

I can’t deny I didn’t feel a bit put-out by this announcement coming only a few days after I’d released the latest version of my own SQL 4 CDS tool, but I’m sure I’ll get over it eventually…

Connecting

Obviously the first thing I did was carefully read the documentation 😜 before moving on to try and connect to my own instances.

The key bit of information you need when connecting is the server name (the same as you see in the address bar of your browser when you’re in your app) and the port number 5558. Put these together with a comma, select the Active Directory - Password authentication option and you should be away.

The result was a bit of a let-down:

TDS protocol endpoint is disabled for this organization

I assumed at this point that this new preview feature just wasn’t available for me yet. A few people suggested creating new instances in the Canada (crm3) region as they seem to get new features earlier, but I got the same error again and eventually gave up for a few days.

Update 2020-05-24: This error message has been updated to give some more help on fixing it now:

Enabling the TDS protocol endpoint

2020-06-25: This process is now much easier, see below!

Andrew Bibby then helpfully pointed me to some more documentation on how to enable the new endpoint. Unfortunately, even when I got the correct command line command to run it still didn’t work. Instead it gave me the error:

Error occurred in OrgDBOrgSettings and the error details are GDS resource provider is unable to get instances for tenantId: '<guid>', response status:'NotFound' and reason 'Not Found'

Instead I ended up actually using SQL 4 CDS to update the orgdborgsettings attribute of my organization entity:

select orgdborgsettings, organizationid from organization

I then added <EnableTDSEndpoint>true</EnableTDSEndpoint> at the end of the list of settings and updated it using:

update organization set orgdborgsettings = '<OrgSettings><IsCommandingModifiedOnEnabled>true</IsCommandingModifiedOnEnabled><EnableActivitiesTimeLinePerfImprovement>1</EnableActivitiesTimeLinePerfImprovement><EnableActivitiesFeatures>1</EnableActivitiesFeatures><CanCreateApplicationStubUser>false</CanCreateApplicationStubUser><AllowRoleAssignmentOnDisabledUsers>false</AllowRoleAssignmentOnDisabledUsers><EnableTDSEndpoint>true</EnableTDSEndpoint></OrgSettings>' where organizationid = '<guid>'

I could then immediately connect using SQL Server Management Studio, so now I’m off to play with it! More updates to come…

Update 2020-05-24

The official instructions appear to work for other people, but not for me. Hopefully you’ll have more luck than I did, but I’ve also included an update in SQL 4 CDS 2.1.0 to simplify this process too.

Update 2020-06-25

You can now enable the new endpoint in the Power Platform Admin Center. Navigate to your environment and click Settings in the ribbon. Under the Product section, click Features to see the page below:

In the bottom right, change “Enable TDS endpoint” to “On” and click Save.

D365UG UK: Data Integrity & Quality – Tips & Tools

A bit of a mouthful of a title! I had a great time earlier today presenting at my first ever virtual event, D365UG UK. Many thanks to the organising team for a smoothly run event!

My topic today was using a variety of free tools to unearth some data quality issues in your CRM installation that you may be unaware of, and some suggestions on common causes and possible resolutions.

The tools I used were:

  • Advanced Find
  • FetchXML Builder
  • SQL 4 CDS
  • Data Export Validation Tool

Those last three are all free XrmToolBox tools, so head over to the tool store to install them today.

The slides are available on the D365UG UK forum, but I also promised to share the script that you can run in SQL 4 CDS to check many of the common scenarios we looked at. Many of these could also be run in either Advanced Find or FetchXML Builder if you prefer.

-- Accounts assigned to inactive users
select a.name
from
    account a
    inner join systemuser u on a.owninguser = u.systemuserid
where
    a.statecode = 0 and
    u.isdisabled = 1

-- Accounts with a disabled primary contact
select a.name
from
    account a
    inner join contact c on a.primarycontactid = c.contactid
where
    a.statecode = 0 and
    c.statecode = 1

-- Accounts with a primary contact that isn't part of the account
select a.name
from
    account a
    inner join contact c on a.primarycontactid = c.contactid
where
    a.statecode = 0 and
    (c.parentcustomerid is null or c.parentcustomerid <> a.accountid)

-- Contacts with the same firstname and lastname
select contactid, firstname, lastname
from
    contact
where
    statecode = 0 and
    firstname = lastname

-- Contacts with the same lastname and email address
select contactid, firstname, lastname
from
    contact
where
    statecode = 0 and
    lastname = emailaddress1
    
-- Common firstnames
select firstname, count(*)
from
    contact
where
    statecode = 0 and
    firstname is not null
group by
    firstname
order by
    count(*) desc
    
-- Common lastnames
select lastname, count(*)
from
    contact
where
    statecode = 0 and
    lastname is not null
group by
    lastname
order by
    count(*) desc
    
-- Contacts with a parent contact
select c1.contactid, c1.firstname, c1.lastname
from
    contact c1
    inner join contact parent on c1.parentcustomerid = parent.contactid
where
    c1.statecode = 0

-- Orphaned contacts
select contactid, firstname, lastname
from
    contact
where
    statecode = 0 and
    parentcustomerid is null

-- Active contacts in inactive accounts
select contactid, firstname, lastname, account.name
from
    contact
    inner join account on contact.parentcustomerid = account.accountid
where
    contact.statecode = 0 and
    account.statecode = 1

SQL 4 CDS 2.0.0 released!

I’m very pleased to release version 2.0.0 of my SQL 4 CDS tool today!

If you haven’t come across it before, SQL 4 CDS is an XrmToolBox tool to query and manipulate your CDS data using regular SQL queries, making it accessible to a much wider range of users. You can install it today from the XrmToolBox Tool Library.

It takes quite a lot to jump from a version 1.0.x to a version 2, so what’s new?

Beyond FetchXML

Version 1 translated your SQL query into FetchXML and executed it, which made it easier to run the queries but didn’t give you the full power of SQL. There were still plenty of queries you couldn’t run because there was no FetchXML equivalent.

While still not supporting every bit of T-SQL (that would be near-enough impossible!), this release brings it a lot closer. While it still uses FetchXML as much as possible, it can now also transform those results to implement more SQL queries than before.

For example, you can now run queries that compare two field values, execute common functions such as DATEDIFF and DATEADD, use a HAVING clause, sort data with linked tables without having to worry about the order of your joins and much more.

Some example queries that you can run now that you can’t do with FetchXML alone:

Find accounts with more than 10 contacts

This query would not normally be possible as FetchXML does not have an equivalent for the HAVING clause, but SQL 4 CDS 2 will handle it:

SELECT   a.name
FROM     account AS a
         INNER JOIN
         contact AS c
         ON a.accountid = c.parentcustomerid
GROUP BY a.name
HAVING   count(*) > 10

Find accounts that haven’t been modified since being created

FetchXML doesn’t have an option for comparing the values of two fields. SQL 4 CDS 2 handles this again, but watch out when running this on large data sets as it’s got to retrieve all the accounts in order to filter them itself.

SELECT name
FROM   account
WHERE  createdon = modifiedon

Find contacts created over a year after their account

A variation on the one above, you can now use more complex expressions such as selected functions, mathematical expressions etc. as part of your query too.

SELECT a.name,
       c.fullname
FROM   account AS a
       INNER JOIN
       contact AS c
       ON a.accountid = c.parentcustomerid
WHERE  c.createdon > DATEADD(year, 1, a.createdon)

Increment the score for all contacts called Mark

These same improvements can also be used as part of assignments in UPDATE statements, e.g.:

UPDATE contact
SET    cdi_score = cdi_score + 1
WHERE  firstname = 'Mark'

Sort by related tables

FetchXML imposes a restriction on sort orders. They have to be applied at the top-level entity first and then work down through joined tables. You can now apply sorts in whatever order you like:

SELECT a.name,
       c.fullname
FROM   account AS a
       INNER JOIN
       contact AS c
       ON a.accountid = c.parentcustomerid
ORDER BY
       c.fullname,
       a.name

Notifications

If you run a query that can’t be executed fully as FetchXML, you’ll see a warning like this:

“Extra processing required” notification

Be sure to not simply copy & paste this FetchXML into your own code. Running this FetchXML directly will give you all the information you need to calculate the results, but you will need to do some post-processing on the results.

Intellisense

This release also includes Intellisense-style suggestions as you type. There are a few caveats to be aware of – as you type a SELECT statement, the tool won’t offer suggestions until you start entering the FROM clause, so if you want suggestions on the SELECT clause itself then just enter SELECT * FROM, enter the FROM clause and then go back.

The metadata used to drive the suggestions is loaded in the background, so you may not get suggestions immediately. Give it a little time after entering the FROM clause and you should start seeing suggestions appear.

I’ve tried to get the suggestions to appear when they feel natural to me. I’m interested in any feedback on this feature and how you’d like to see it work.

Managing many-to-many relationships

You can already use INSERT, UPDATE and DELETE to edit your data, but this release improves on that to cover the intersect entities used in many-to-many relationships. For example, you can use this query to add all your contacts called Mark to a specific marketing list:

INSERT INTO listmember (listid, entityid)
SELECT '5610d244-aac2-4fa8-9b2f-56537c80bad0',
       contactid
FROM   contact
WHERE  firstname = 'Mark'

Managing polymorphic lookups

If you INSERT or UPDATE a lookup field you can use the guid of the related record, but this only works for lookups that can only reference one entity type. For example, this would work:

UPDATE account
SET    primarycontactid = 'e3dd3640-9890-4371-b1cb-fb1b26f20ff3'
WHERE  name = 'Data8 Ltd'

However, this would not work as the parentcustomerid field can reference both account and contact records, and there’s nothing to indicate which type is being referenced here:

UPDATE contact
SET    parentcustomerid = '65a7e7f7-0f09-4649-9182-56abc489a1e2'
WHERE  fullname = 'Mark Carrington'

This release introduces the CREATELOOKUP function which can be used to solve this:

UPDATE contact
SET    parentcustomerid = CREATELOOKUP('account', '65a7e7f7-0f09-4649-9182-56abc489a1e2')
WHERE  fullname = 'Mark Carrington'

Inside FetchXML pt 10 – link-entity (again)

A follow-up on my previous post on link-entity, prompted by this tweet from Daryl LaBar:

I’d previously stated (incorrectly):

FetchXML supports two link types – inner and outer.

Me, in more ignorant times

While these are the most common link types – and the only two that FetchXML Builder exposes – the platform has supported a third for some time, and now appears to have several more too!

TL;DR

There are some tantalising hints at new join types, but you probably shouldn’t be jumping into using them unless Microsoft support advises you to just yet!

Natural Join

The documentation for the natural join type is not exactly clear to me:

Only one value of the two joined attributes is returned if an equal-join operation is performed and the two values are identical

docs.microsoft.com

I really have no idea what this is trying to say – if you can enlighten me, please do! Given that all the joins it does are based on the attributes being equal, and therefore the two values must be identical, I can’t see what this is trying to describe.

As I can’t tell from the docs what it is trying to do, I’ve tried to work it out by running some queries and having a look at the generated SQL (thank you on-premise!) and the results.

My conclusion: natural is a synonym for inner.

If you substitute natural for inner, you will get exactly the same results, as it generates exactly the same SQL.

Many to many relationships

One thing I did wonder based on the docs for the natural join type was whether it would try to imply the join attributes by name, i.e. you could miss out the from and to attributes on the link-entity if they were both the same and it would imply them from the metadata. This failed, but it did produce this intriguing error message:

No system many-to-many relationship exists between account and contact.  If attempting to link through a custom many-to-many relationship ensure that you provide the from and to attributes.

This seemed interesting, so I gave it a go with this simple query:

<fetch>
  <entity name="account">
    <attribute name="name" />
    <link-entity name="list" link-type="inner">
      <attribute name="listname" />
    </link-entity>
  </entity>
</fetch>

And it worked! This is a simple shortcut for navigating built-in many-to-many relationships without having to use the more explicit version of joining via the intersect entity:

<fetch>
  <entity name="account">
    <attribute name="name" />
    <link-entity name="listmember" to="accountid" from="entityid" alias="listmember" link-type="inner">
      <link-entity name="list" to="listid" from="listid" alias="list" link-type="inner">
        <attribute name="listname" />
      </link-entity>
    </link-entity>
  </entity>
</fetch>

New Join Types

Anyway, onto the original challenge. The docs now lists a total of 10 different join types:

  • Inner
  • LeftOuter
  • Natural
  • MatchFirstRowUsingCrossApply
  • In
  • Exists
  • Any
  • NotAny
  • All
  • NotAll

As there’s no documentation on any of these I headed over to the simple but trusty FetchXml Tester XrmToolBox tool and tried one at random:

<fetch>
  <entity name="account">
    <attribute name="name" />
    <link-entity name="contact" to="accountid" from="parentcustomerid" link-type="notall">
    </link-entity>
  </entity>
</fetch>

This gave the remarkably helpful error message:

An error occured: Invalid link-type specified, valid values are: 'natural', 'inner', 'in', 'matchfirstrowusingcrossapply','exists' and 'outer'. link-type = notall

So maybe CDS doesn’t support all 10 join types yet, but it does appear to support some we didn’t have before:

  • In
  • MatchFirstRowUsingCrossApply
  • Exists

I also tried the same thing on an on-premise installation, and that does only support the original 3: inner, outer and natural, so this is online-only for now at least.

So, in the absence of any documentation on these, or the ability to see what SQL gets generated for them, what are we to do? Well, these all seem to have directly equivalent SQL keywords, so starting from a simple query:

<fetch>
  <entity name="account">
    <attribute name="name" />
    <link-entity name="contact" from="parentcustomerid" to="accountid" link-type="inner">
      <attribute name="lastname" />
      <filter>
        <condition attribute="firstname" operator="eq" value="Mark" />
      </filter>
    </link-entity>
  </entity>
</fetch>

My guess for the generated SQL when changing the link-type to different options would be:

Inner

SELECT name,
       lastname
FROM   account
       INNER JOIN
       contact
       ON
       account.accountid = contact.parentcustomerid
WHERE  contact.lastname = 'Mark'

In

SELECT name
FROM   account
WHERE  accountid IN (
  SELECT parentcustomerid
  FROM   contact
  WHERE  firstname = 'Mark'
)

MatchFirstRowUsingCrossApply

SELECT name,
       contact.lastname
FROM   account
       CROSS APPLY (
         SELECT TOP 1 lastname
         FROM         contact
         WHERE firstname = 'Mark' AND
               parentcustomerid = account.accountid
       ) AS contact

Exists

SELECT name
FROM   account
WHERE  EXISTS (
  SELECT *
  FROM   contact
  WHERE  firstname = 'Mark' AND
         parentcustomerid = account.accountid
)

This is just conjecture at this point, but if correct it would mean:

  1. Each of the new join types would only produce one row per account that has a contact called Mark, regardless of whether there was one such contact or 100. The standard inner join type however would produce one row for each contact
  2. The in and exists versions would not be able to include any attributes from the linked entity (lastname in this example), as it is only referenced from the WHERE clause, but the inner and matchfirstrowusingcrossapply would be able to

So, let’s give it a go!

I’ve created two accounts, one with no contacts called Mark and one with two:

And the results?

Inner Join

No surprise here, but good for a simple baseline:

<fetch>
  <entity name="account">
    <attribute name="name" />
    <link-entity name="contact" to="accountid" from="parentcustomerid"  link-type="inner">
      <attribute name="lastname" />
      <filter>
        <condition attribute="firstname" operator="eq" value="Mark" />
      </filter>
    </link-entity>
    <filter>
      <condition attribute="name" operator="like" value="Join Operator %" />
    </filter>
  </entity>
</fetch>
namelastname
Join Operator 2Carrington
Join Operator 2Duplicate

In

Executing the exact same query but replacing inner with in gave the error:

An error occured: In doesn't support attribute inside linkentity expression.

Removing the <attribute> from inside the <link-entity> gave the expected result:

name
Join Operator 2

So far, so good!

Exists

Going through the same process with exists, again we get the same error about not being able to use an <attribute> within the <link-entity>. Removing that gives us the same results again!

MatchFirstRowUsingCrossApply

Trying again with matchfirstrowusingcrossapply, and it almost works as I expected. I can run the query with the <attribute name="lastname" /> element inside the <link-entity> without an error, but I don’t actually get that column back in the result set – I get the same results again as I did with in and exists!

Given how close the other predictions were, I’m going to chalk this down overall as a success, with the missing column in the results possibly being due to a CDS bug that will hopefully be fixed at some point. As always though, I’d be very happy to get any firm information on it one way or another!

Why?

So now we have a reasonable idea of what these new join operators are doing, why might we want to use them? We’ve managed quite happily with inner and outer until now, thank you very much.

Uniqueness

The main thing that these operators give us that we couldn’t do before is removing the duplication problem. We saw earlier that we can use distinct to remove duplicate rows in query results that were introduced by joining to multiple child records, but this also removed the automatic primary key from the results. Because these operators do not produce duplicate rows regardless of the number of child records, there is no need to use distinct. This was never that much of a problem though, as we could always include the primary key column explicitly anyway. So what else is there?

Single Child Entity

The matchfirstrowusingcrossapply option looks the most powerful, although as I haven’t managed to actually get any attributes out of the linked entity I can’t realise that yet. It would be most powerful if we could combine that with a sort order, so you could get a list of accounts along with the “best” contact for each account:

<fetch>
  <entity name="account">
    <attribute name="name" />
    <link-entity name="contact" to="accountid" from="parentcustomerid"  link-type="matchfirstrowusingcrossapply">
     <attribute name="contactid" />
     <attribute name="fullname" />
     <order attribute="cdi_score" descending="true" />
    </link-entity>
  </entity>
</fetch>

That looks like it could give us a unique list of accounts, along with the details of the top contact in the account according to the ClickDimensions score. Unfortunately, running that gives us an error:

MatchFirstRowUsingCrossApply doesn't support order clause inside linkentity expression.

So far, that looks like a missed opportunity, as the corresponding SQL would work perfectly fine:

SELECT name,
       contact.contactid,
       contact.fullname
FROM   account
       CROSS APPLY (
         SELECT TOP 1 contactid,
                      fullname
         FROM         contact
         WHERE        parentcustomerid = account.accountid
         ORDER BY     cdi_score DESC
       ) AS contact

Performance

So what else? How about performance? If we compare the four simple examples we looked at before, we get three quite different query plans:

Comparative performance of different join operators

Even when adding in a DISTINCT option to the simple inner join type, this takes effectively no time at all to execute. All three of the other new options take significantly longer, with the matchfirstrowusingcrossapply option taking the vast majority of the time. So there’s no quick performance win here by just turning one of these options on blindly.

My best guess then is that these options have been added following some performance analysis of other, more complex queries that Microsoft have seen from their telemetry. This could possibly be for their 1st party Dynamics 365 apps, AI or other applications that have a particular need for more generating specific queries. This would probably explain the lack of documentation as much as anything else.

As there are still some other join types listed in the documentation which the server doesn’t appear to support yet, it could also be the case of the documentation getting ahead of the code and we’ve got some more exciting platform extensions coming in a future release!

Inside FetchXML pt 9 – distinct

You can ensure your query only produces each row once using the distinct attribute. For example, if you have multiple entries for the same company name you can get the list of unique names using:

<fetch distinct="true">
  <entity name="account">
    <attribute name="name" />
  </entity>
</fetch>

This option gets applied automatically whenever you create a query using Advanced Find. This is particularly useful when you add a join onto a related child entity. For example, this query:

<fetch>
  <entity name="account">
    <attribute name="name" />
    <link-entity name="contact" from="parentcustomerid" to="accountid" link-type="inner">
      <filter>
        <condition attribute="firstname" operator="eq" value="Mark" />
      </filter>
    </link-entity>
  </entity>
</fetch>

The intent here is to get a list of accounts that have a contact called Mark. However, if an account has two Marks, that account will be in the resulting list twice. Adding the distinct attribute makes sure the user gets what they were probably expecting, which is a list of unique accounts.

Unique Identifiers

One other interesting side effect of using distinct is that you no longer get the primary key of the entity back automatically. If you run the query:

<fetch>
  <entity name="account">
    <attribute name="name" />
  </entity>
</fetch>

As well as getting the account name back, you’ll also automatically get the accountid as well.

When you apply distinct, this goes away. This makes sense if you think about it. Imagine you have two accounts with the same name. If the accountid was automatically included, that would make the records unique. You would still get two rows back from your query, regardless of whether or not you used distinct.

If the purpose of using distinct is to eliminate duplicate rows introduced by a join onto a child entity as shown above though, you probably do want to get a row back for each of your top-level entity type. In that case you will need to explicitly include the primary key attribute in your query, e.g.

<fetch distinct="true">
  <entity name="account">
    <attribute name="accountid" />
    <attribute name="name" />
    <link-entity name="contact" from="parentcustomerid" to="accountid" link-type="inner">
      <filter>
        <condition attribute="firstname" operator="eq" value="Mark" />
      </filter>
    </link-entity>
  </entity>
</fetch>

Truncation

One “gotcha” to be aware of. Any text values will be truncated to 2,000 characters before the distinct operation is applied. If you have longer values, e.g. descriptions or the contents of attachments in the annotation entity, any extra data after the first 2,000 characters will be lost.

Quite why this would be is a bit of a mystery. My best guess is that these fields used to be stored in a SQL Server ntext field which couldn’t be used in a SELECT DISTINCT query, so to work around this problem Dynamics CRM would cast these values to an nvarchar(2000) value which could be used instead. You can see this in the generated SQL:

<fetch distinct="true" top="10">
  <entity name="annotation">
    <attribute name="documentbody" />
  </entity>
</fetch>
select 
 DISTINCT  top 10  cast( "annotation0".DocumentBody as varchar(2000) ) as "documentbody" 
from
 AnnotationBase as "annotation0"

As these columns are now stored using an nvarchar(max) field which can be used directly in a SELECT DISTINCT query I guess this behaviour is still in place for backwards compatibility reasons, although it does give about a 35% performance improvement:

SQL Equivalent

Apart from this once caveat of long text fields, this is directly equivalent to the SQL DISTINCT keyword.