DataOps Principle #8 - Reflect
As promised from Part 27, I said I would offer tips on pulling SharePoint verison history into Power BI. More specifically, this was the expectation:
Expectation #2 - SharePoint has version history so I want to be able to run analytics in Power BI leveraging when list items change. For example, I'd like to know how long it takes for a SharePoint task to change from new to assigned.
Background
For those who don’t know, a SharePoint list (and document library) can be configured to store prior versions of a list item/document. In fact, if you use SharePoint Online and created a list then this feature should be enabled with a default of 500 versions. At the time of this writing the maximum number of versions allowed under this feature is 50,000.
From an analytics perspective this means you could evaluate the changes in a list item’s column values to do a time series analysis. This analysis could be looking for bottlenecks in processes or calculating metrics for how quickly a task is completed.
For example, if you have an Issue Tracker list, via the version history interface you can see the timestamps for the Status of an issue to change from New to In progress (see example in Figure 1). But that’s for one issue, what if you wanted to know how many hours it takes, on average, for issues to have their status change from New to In progress? That’s where Power BI can help.
Figure 1: Example version history of a SharePoint list item
The Challenge
But, as of May 2023, this type of analytics capability in Power BI can be a challenge because:
Learning Curve - There is no native connector that can retrieve the version history from a SharePoint list. Thankfully there is an API endpoint to get version history for a SharePoint list item, but you’re still left with using and learning Web.Contents. Web.Contents is a great function, but not one that is easily accessible through a low-code fashion.
Text Limit - A version of a list item could easily be over the text limit of 32,766 characters. Text-based data over this limit is truncated in a Power BI dataset. In addition, what if you wanted to keep versions of the data in the raw format, so you could answer questions or test hypotheses that haven’t been considered yet?
Potential Solution
To overcome these challenges, I’ve put together a solution on GitHub at this link. This solution combines the lessons learned from power-query-sharepoint-faster-easier and the capabilities of Power BI dataflows (within a Premium or Premium-Per-User workspace) to store SharePoint list item versions for analytic needs.
Table 1 outlines how each challenge is addressed with this solution.
Challenge | Solution |
---|---|
Learning Curve | Leverage the techniques from “A Faster (Easier?) way to import SharePoint list data into Power BI” to make generating the OData queries a low-code (easier) experience. |
Text Limit | With Power BI dataflows, we can store data in text columns well over 32,766. I tried to find the theoretical limit documented, but I could not. However, I tested text columns close to 2 million records and I had no truncation in a Power BI dataflow. |
Table 1 - Overcoming the challenges with ingesting SharePoint version history into Power BI.
The solution uses the Medallion architecture which separates responsibilities of each Power BI dataflow to support consistency, flexibility, and durability with the data. The Bronze layer is intended to get the raw data which in this solutions’ case is the raw JSON information for each version of a SharePoint list item. The Silver layer links the output of the Bronze layer and parses the JSON information, cleans it, and pivots it. The Gold layer links to the Silver layer’s output and provides the knowledge to be used in a Power BI dataset. Going back to our example I provided earlier in this article, the Gold layer would provide a table of data that has a row for each list item, and it indicates how long it takes for its status to change from New to In Progress.
As described in Figure 2, the solution provides you the code to build the Bronze Layer and I also provide you with instructions to get started on your Silver Layer. After that, it’s up to you to finish building the Silver Layer and complete the Gold Layer.
Figure 2 - Medallion approach to ingesting SharePoint version history into a Power BI dataflow.
429 – Too Many Requests (a.k.a., throttling)
For large lists (greater than 5,000) I’ve seen dataflow refreshes get throttled by the SharePoint API as indicated by an HTTP Status code of 429. When this occurs, the refresh will fail. Unfortunately, this solution is not capable of natively avoiding this issue because the capability known as ManualStatusHandling is only available for anonymous connections (and we need to use OAuth2 for SharePoint).
My hope is with upcoming releases to Power Query Online this can be alleviated, so in the meantime here are design pattern options to get around the 429 issues.
1) Option 1 - Parallelize the work – The ID Filter by default in the dataflow only retrieves the first 20 items for getting version history (see screenshot below). You could clone these workflows into increments of 5000 (approximately) to collect version history. For example, one dataflow would have the ID Filter “ID le 5000” and another dataflow would have the ID Filter “ID gt 5000 and ID lt 10000”. Just be sure to stagger the refresh times so that they run right after one another.
Figure 3 - ID Filter limits the list items you retrieve.
2) Option 2 - Incremental Refresh – Now normally I would go to this option first for SQL connections but calling the SharePoint API is different. With incremental turned on, leveraging the Modified column (like the screenshot below), Power Query may invoke numerous API calls to SharePoint to create partitions for storage and thus get throttled. I find that this approach is good when you setup the dataflow for a SharePoint list early on in its existence (less than 1000 entries). The partitions would be established and then incremental would keep the number of API calls to SharePoint low if you have applied an appropriate “Refresh rows from the past” setting (e.g., 2-3 days).
Figure 4 - Incremental Refresh window.
Feedback
Please give the solution a try, and I’d love to hear your thoughts on this subject.
Finally, if you’re going to be at the 365 EduCon in Washington D.C., June 12th-16th 2023, I’ll be presenting on Wednesday and Friday. I hope to see you there!