Author: Etienne Oosthuysen
Data enablement – can Azure Purview help
For several years now, I have been evangelising about the need for data enablement in organisations. In a nutshell, this means getting data to consumers (be it final consumers of reports, or consumers who go on to create data solutions such as data engineers, data scientists and business focussed data modellers) faster, and accepting that data transformations, data modelling and new data solutions (models, reports, dashboards, etc.) will occur as part of a highly agile, rapid development process and by multiple data workers, some of whom are external to ICT.
Technology has now reached that part in the maturity curve, where this enablement will accelerate and become the norm, replacing old school, linear data workloads performed by central BI/ ICT teams only. Core to these technologies is Data Lakes for storage at scale (the ‘lake’ of your ‘lake house’), Synapse or Databricks for on-demand transformations and the virtual data warehouse layer (the ‘house’ of your ‘lake house’), and Power BI for business models and visualisations (the ‘front door’ to the whole thing) and of course resources to move data into, and around the ecosystem.
BUT all this enablement now demands a massive rethink of governance – both in terms of methodology as well as technology. Long and laborious theory heavy data governance methodologies simply won’t keep up with the rapid internal growth of the internal data market and the many workers across the organisation who take part in data related activities. An alternative, much more pragmatic methodology is required and must be supported by technology that posses two crucial things: (1) Artificial Intelligence to simplify and accelerate the process of data cataloguing and classification, and (2) crowd sourcing so that users across the business can quickly add to the collective knowledge of the data assets of the business. And it is in the technology space where Azure had a massive blind spot.
Introducing Azure Purview.
The word Purview simply means the ‘range of vision’ and when it comes to data, then the greater the range of this vision and the clearer the objects you see, the better. Will Purview live up to this definition of its more generic namesake and will it cover the blind spot I previously mentioned?
The current generally available version is the first of multiple planned modules for Purview, i.e., the Data Catalog module. This first module supports AI based cataloguing and classification of data across your data estate, curation, and data insights. Users will in addition be able to use and maintain business glossaries, expressions to classify data based on patterns beyond the out-of-the-box classifications (let’s call these bring your own (BYO) expressions to cover additional patterns), provide visibility over ownership and custodianship, show lineage, etc.
This will have immediate benefit to anyone seeking pragmatic data governance as it will immediately provide a heap of knowledge about the data in your data estate via 100+ out of the box scanning rules, something that would have required resource intensive and error prone human activity, plus it enables a data worker to augment/ override the AI scanning in a crowd sourcing ecosystem, or by allowing data workers to BYO scanning rules.
In the recent road test, we dumped a whole load of data into Azure Data Lake and set Purview scanning loose over it to do its AI and built-in classifier magic. The results looked pretty good and goes a long way to fulfil that requirement I mentioned before for pragmatic and accelerated data governance.
In the next blog, I will go through a detailed review and highlights of Azure Purview. Stay tuned!