Power BI has an ever-expanding list of supported data connections, Databases, Online Services, Python Scripts, and more. At first glance, it appears that they have pretty much everything covered. There is one notable exception however, and it came to light in a recent engagement; RESTful API connectivity. ‘Not to worry though’ I hear you say, ‘PowerQuery M solves this problem for us!’. Well, yes, but it’s a bit more complicated than that. Luckily this post is here to provide spiritual and technical guidance through the various spiky pits that litter the pathway to successfully having a RESTful Data Source refreshing on a schedule in the Power BI Service. But before we get to the how, we must first understand the why.
To make a short story slightly longer (and provide context), exposé was recently engaged with a client who was looking to get more insight into the performance of their development team. As a primarily Microsoft-based organisation, the reporting platform that was most utilised (and best understood) was Power BI, and the flexibility in development that Power BI provides was something they wanted to maintain. However, as part of the movement towards more measured project management and bug tracking, the business had decided to migrate to a Jira platform, which does not natively connect to Power BI. Whilst Jira does have a (quite good in my opinion) native reporting platform, the business wished to keep all of their reporting in the Power BI Service, for both the reasons listed above and minimisation of technical debt (report sprawl through multiple platforms is a one-way ticket to an expensive amalgamation effort). So we ended up in somewhat of a square peg and a round hole situation.
The business benefits to solving a problem of this nature are plainly obvious. In any business there are going to be multiple data sources that may not necessarily play nicely with each other, and this can cause serious dilemmas when trying to gain holistic insights into how the business is running. For larger organisations needing high numbers of reports it may be prudent to engage a team of Data Engineers who can cultivate, format, and model the data to be easily imported and visualised by the Visualisation specialists (in fact, this is recommended). For smaller organisations or small numbers of reports however, this large-scale operation may not be financially viable. Luckily most web-based services have a common data output pipe that we can attach to in such circumstances, being a RESTful API. Connecting this pipe to a Power BI Service-hosted dataset, however, does come with a number of challenges.
Challenge #1: Power BI has no native RESTful connection
When we create a dataset in PowerBI (standalone datasets are almost always better, but that’s a topic for another post), we need to connect to data. But how do we do this when there’s no native connection? Searching for ‘API’, ‘REST’, or (in this instance) ‘Jira’ in the ‘Get Data’ connection source window gives us a frustrating 0 results. But RESTful APIs are served via Web request, so we can utilise the ‘Web’ connection and start on our journey.
When we open a Web data connection we’re presented with both a ‘Basic’ and ‘Advanced’ series of parameters. For the purpose of this we’re going to do most of our configuring under the hood, so we can start with a Basic connection. If you’re reading this I’m assuming you’re somewhat familiar with how APIs work, so you can simply plug the web URL of the service’s API into the ‘URL’ box. For Jira (to get a list of all issues), this takes the format of https://<site URL>/rest/api/2/search?jql=
Clicking ‘OK’ will then prompt us to choose our authentication method, and if we click ‘Organisational account’ we can enter our username and password and data will start to flow! Congratulations, you’ve connected to a RESTful API, simply publish and start building reports…
… not really. Unfortunately, this leads nicely into the next challenge.
Challenge #2: Avoiding Organisational Authentication
There are a number of reasons to avoid using personal organisational authentication when creating a dataset, but one is good enough: if you develop something and your account is disabled, things break. To avoid this it’s best practice to use an authentication token, which Jira is happy to provide. Ideally this is linked to a service account for the same reason.
The easiest way to inject this token into our new data connection is via the PowerQuery editor. Clicking on the source we just created you should see something in the formula bar along the lines of:
= Json.Document(Web.Contents(“https://<site URL>/rest/api/2/search?jql=”))
We can place our Authentication token into the headers of this request by modifying it to the following:
= Json.Document(Web.Contents(“https://<site URL>/rest/api/2/search?jql=”, [Headers=[Authorization=”Basic <authentication token>”]]))
This should now allow us to change the Data Source credential type to ‘Anonymous’ (File -> Options and Settings -> Data source settings -> Edit Permissions -> Credentials -> Edit… -> select Anonymous and save) while still being able to refresh the data. At least, locally…
Challenge #3: Refreshing the Data Source in Power BI Service
We now have some data that we can play with, expand on, chop, change, move, and model. Most of the time when we get to this point it’s time to start building measures and calculated columns, sanitise things, publish to the Power BI Service and away we go. In this particular instance however, we’re not quite there yet.
The reason for this is that if we published the Dataset as it stands, we would not be able to refresh it (nor schedule a refresh) in the Service. This is because the Power BI Service will throw an error if we try to refresh a dynamic data sources. What this basically means is that the Service must be able to validate that the query runs successfully without needing authorisation before applying the authorisation token. Microsoft documentation of this issue is available here.
This means that a small manipulation is required to the M code to break the URL up, allowing for the first part of the URI to be verified as runnable by the Service before adding our direct linking to the information we’re trying to get. This is done as follows:
In our ‘source’ step, we need to modify the formula (via the formula bar) to separate the Web.Contents() string into sections – the first (main) section that can be accessed without needing authorisation, and a RelativePath section that takes us the rest of the way. The format is as follows (pseudocode):
Web.Contents(“https://<publicly accessible URL>”, [Headers[Authorization=”Basic <authentication token>”], RelativePath=”/<path to be appended to public url>”])
In our Jira example, our RelativePath is “/rest/api/2/search?jql=”. By modifying the source step for our dataset and publishing to the Service, we are now able to refresh (and schedule refreshes on) our data! Well, at least the first 50 results.
P.S. On the first publishing and refreshing this data source you will run into an authentication issue in the service indicating that you need to provide credentials. This is fixed by navigating to the settings of the data source, selecting ‘Anonymous’ as our Authentication method, and opting to skip test connection.
Challenge #3: Overcoming Pagination
The default behaviour of a lot of cloud services is to serve paginated results from their APIs (hence the 50 results above). This is probably wise from the point of view of the services, as it prevents poorly (or lazily) designed queries from consuming unnecessary bandwidth (and potentially cost). However, for our applications we really need more than one page of data, so we need to find some way to overcome this.
In some instances, we are able to modify the page size up to a more reasonable number (say, 100,000 results). However, the Jira (and most larger cloud-based SaaS providors) developers saw this coming and hard limited the number of results returned by the service to a maximum of 100. For our application that means we have to take a bit more of a programmatic approach, as we definitely want more than that.
To best understand the next bit it is important that we have a bit better understanding of PowerQuery M. At its core it is a functional programming language, which has a bunch of features that are really not explored fully due to the immense drag and drop (or no-code) capabilities contained in the PowerQuery engine. These programmatic characteristics have a multitude of uses that are rarely utilised, and I would encourage more BI Developers dive under the hood to see what potential improvements can be made to their data wrangling. For this application we want to utilise these programmatic characteristics by way of writing a loop function that returns all of our results.
There are a few good blog posts that explain how we can create functions in PowerQuery M (including specific code examples, which I won’t go over here) to solve problems, including some that cover looping and pagination. The general methodology is to create a lovely query that utilises parameters to pass information to a function which loops through the pages and requests all the information from the API, which it then concatenates back into one dataset. There is however a problem with this approach, which is that refreshing functions in the Power BI Service is not currently supported.
There’s one final piece to this puzzle, which is hidden a bit in plain sight; the invocation of custom functions as a query step. Instead of creating a single almighty query that does everything we need, in order to have functionality in the service we need to break it into two steps; firstly generate all of the parameter information (i.e. number of pages, number of results per page, etc.), and then pass that information to a function that then retrieves the data from the API. The best way (I’ve found) to do this is by use of a custom column. Again we can modify the M code to pass more information (parameters) in to the function if we need to, which will help us jam more functionality in and supersede the ‘normal’ use of a custom column.
Congratulations, you can now use REST APIs as a Data source in Power BI
To recap, the strategy we have employed has been formed in three parts, enforcing a number of best practices that will hopefully be obvious as to why they needed to be used;
- Circumvent the use of personal accounts in preference of organisational tokens
- Modify our query to allow for successful refreshing of our data source in Power BI
- Use the programming functionality of PowerQuery M to creatively overcome stagnation in pagination by way of a custom function and custom column.
This strategy is an extremely effective way to simply, quickly, and cheaply move cloud-based platform data into Power BI, without necessitating a large-scale architectural solution. By no means should this be used as a replacement for such a solution if the use case dictates, but for small applications or interim measures on critical data it is extremely effective. Between the engagement for which this method was developed and the writing of this blog I have also confirmed this works for all tested RESTful APIs, so the example here utilising Jira is just an example and not a limitation.
Limitations and Considerations
There are a couple of limitations to the use of this technique that are noteworthy – primarily around visibility, frequency, and cost.
This technique does require some specialised knowledge to develop, modify, and upkeep. Whereas a more conventional data engineering solution may have commonly-held language such as Python or Scala at its core, using PowerQuery M to this level may limit the pool of developers that are able to maintain the solution. This is especially important in a consulting context – some consultancies may see this as job security but at exposé we want to empower the businesses we work with, not hamstring them.
This is not really applicable in the context of microbatch or live data refresh periods, which is true of nearly all RESTful API solutions. Due to the number of back and forward communications required to retrieve the data from the platform, and the type of retrieval going on, there are far fewer optimisations that can be made to reduce query refresh time. This is not to say it can’t be done, but there is definitely a lower limit that is largely dependent on the implementation of the API. For significant datasets this will be reflected in the refresh time.
Some platforms charge on a per-API-call basis. Due to the pagination we can be making a significant number of calls to the API, so it is important to understand the costing structure and make sure the business is aware as such; no-one wants to be stuck with an unexpected large cost. An important query optimisation with respect to this is making sure that our page sizes are as large as they can be. In our Jira example, the default page size is 50 and maximum is 100, meaning that we would need twice as many API calls (and potentially twice the cost) if we left the page sizing as default. An additional optimisation that can be made is limiting the frequency of refreshes, which will impact cost similarly.