Author: Etienne Oosthuysen
(And a party down the lakehouse)
Cloud databases are a way for enterprises to avoid large capital expenditures, they can be provisioned quickly, and they can provide performance at scale. But data workloads continue to change, fast, which means conventional databases alone (including those running in data warehouse configurations) can no longer cope with this fast-changing demand.
Exposé have over the past few years written and spoke extensively about why conventional data warehousing is no longer fit for purpose. The future data warehouse must at least:
- Be able to cope with data ranging from relational through to unstructured.
- Be able to host data ingested in a latent manner (e.g. daily) as well as real-time streams, and everything in between.
- Be able to host data in its raw form, at scale, and at low cost populated by extract and load (EL) or data streams.
- Provide the mechanisms to curate, validate and transform the data (I.e. the “T” of ELT).
- Be able to scale up to meet increasing demand, and back down during times of low demand.
- Be able to integrate seamlessly into modern workloads that rely on the DW; these include AI, visualisations, governance and data sharing.
What is Azure Synapse Analytics?
Say hello to Azure Synapse Analytics now in public preview – https://aka.ms/Synapse_Insights4All
Microsoft describes Azure Synapse Analytics as a “limitless analytics service that brings together enterprise data warehousing and Big Data analytics. It gives you the freedom to query data…, using either serverless on-demand compute or provisioned resources—at scale.” https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/
It is the new version of the Azure SQL DW and gives Microsoft a stronger competitor platform against AWS Redshift, Google Big Query and Snowflake. For background, a comparison between the three competitors can be found at https://gigaom.com/report/data-warehouse-cloud-benchmark/ – note that it was done for Synapse’s predecessor Azure SQL DW.
No, what is it really?
Microsoft’s description of “limitless analytics service that brings together enterprise data warehousing and Big Data analytics” can be translated as two siblings that historically hosted two different types of data; i.e. highly relational (the ‘enterprise data warehousing’ or SQL workloads) and all other including semi-structured and unstructured data (i.e. the ‘big data’ workloads in data lakes), unified in a workspace that allows the user to query and use both SQL/ relational and big data with languages that they are comfortable with (SQL, Python, .NET, Java, Scala and R). It breaks down the barriers between the DW and the data lake. Now that is HUGE (don’t act like you’re not impressed). Imagine that…a “lakehouse”.
This is shown conceptually in the image below.
Okay, so it’s SQL and Spark, wrapped into a clever unified limitless compute workspace? No, it’s a bit more
Firstly, it includes Data Integration, so it not only unifies the differing data types (i.e. relational and big data), but it also includes the means to ingest and orchestrate this data using Azure Data Factory, which has become so pervasive in the market, natively inside Synapse (called Data Integration). Does this mean batch data loading only? No, you can of course load your realtime data streams to your data lake using some kind of IOT Hub/ Event Hub/ Stream Analytics configuration or achieve low latent data feeds into your data lake using Logic Apps or Power Automate.
Secondly, it not only integrates with Power BI, it actually includes Power BI as part of Synapse. In fact, interactive Power BI reports and semantic models can be developed within the Azure Synapse Studio. Imagine the ability to quickly ingest both structured and unstructured data into your data lake, either move the data into SQL (the data warehouse) or leave it in raw form in the lake. Then you have the ability to explore the data using a serverless SQL environment whether the data resides in the data lake or in the data warehouse, and potentially do this all in a Direct Query mode. This not only reduces the Power BI model footprint and hands the grunt over to Synapse, but also allows for much more real time reporting over your data. https://azure.microsoft.com/en-au/resources/power-bi-professionals-guide-to-azure-synapse-analytics/
Thirdly, it integrates seamlessly with Azure Machine Learning for those who need to use data from the unified platform for predictive analytics or deep learning and share results back into the platform for wider reuse. Including using Azure Data Share for a seamless and secure data sharing environment with other users of Azure.
Ah okay, so…
It unifies the DW and the data lake (real time, latent, and data of any type) and it also brings Data Integration and Data Visualisation into that unified platform. It then seamlessly integrates with Machine Learning and Data Share. So, its SQL, Spark, ADF and Power BI all at the same party, or ahem…lakehouse 😊 where you can ingest, explore, prepare, train, manage, and visualise data through a single pane of glass. Yes, we are bursting with excitement too!
Let’s get technical
Let’s look at some of the technical aspects of Synapse:
- Users can query data using either serverless on-demand compute or provisioned resources.
- Serverless on-demand compute (technically this is called SQL on demand) allows you to pay per query and use T-SQL to query data from your data lake in Azure rather than provision resources ahead of time. The cost for this is noted as approximately $8.90 per TB processed – https://azure.microsoft.com/en-us/pricing/details/synapse-analytics/. This feature is still in Preview.
- Provisioned resources, in line with the incumbent SQL DW data warehouse unit (DWU) regime that allows the user to provision resources based on workload estimates, but able to scale up or down, or pause within minutes (technically this is called SQL Pools) – see https://azure.microsoft.com/en-us/pricing/details/synapse-analytics/. This feature is still in General Availability.
- Azure Synapse Studio supports a user’s ability to ingest, explore, analyse and visualise data using a single sleek user interface, which is sort of a mix between the Azure Data Factory and Databricks UI’s.
- Users can explore the data using a refreshed version of Azure Data Explorer.
- Users can transform data using both T-SQL (data engineers) and Spark Notebooks (data scientists).
- On the security front, there is threat detection, transparent data encryption, always-on encryption, fine-grained access control via column-level and native row-level security, as well as dynamic data masking to automatically protect sensitive data in real-time.
Please see this important fact sheet and a list of capabilities in General Availability vs those in Preview – https://azure.microsoft.com/en-us/services/synapse-analytics/#overview
Also, please see our essential cheat sheet for the Synapse SQL on-demand test drive where we put that exciting new service through it’s paces and help you get up an running, quickly – https://exposedata.com.au/azure-synapse-analytics-the-essential-spark-cheat-sheet/
What are the business benefits?
They are numerous, but in our humble opinion, and it must be noted that we do have extensive experience in data warehouses and modern data platforms, these are:
- The lakehouse that unifies Spark and SQL engines, PLUS the ability to query them through a single pane of glass is something the industry have been asking for, for a long time as it breaks down data silos. As a result, is also breaks down skill silos, as those familiar with SQL can continue using SQL and those that prefer Python, Scala, Spark SQL, or .Net can do so as well…all from the same analytics service.
- The new serverless on-demand compute model allows users to use T-SQL to execute serverless queries over their data lake and pay for what they use. This coupled with the Provisioned Resources model enables customers with multiple ways to analyse data so they can choose the most cost-effective option for each use case.
- Security including column-level security, native row-level security, dynamic data masking, data discovery and classification is all included at no additional cost to customers.
How can Exposé help?
We are Australia’s premium data analytics company and have, since our inception, made sure we fully understand changes in the data analytics market so that we can continue to tailor our best of breed architectures and solutions to our customers’ benefit. Below is some of the highlights of our journey:
- We were the first consultancy to coin the phrase “friends don’t let friends build old school and expensive data warehouses” – we were passionate about finding solutions that were truly modern and delivered the best ROI.
- We were one of the first consultancies in Australia to understand the value that Databricks could play in modern data workloads, championed it, and facilitated one of the most high-profile solutions which have won our client multiple awards.
- We were selected as the runner up for the Power BI Global Partner of the year 2019 due to our big data smart analytics solution we created for our customer that embraced many leading-edge Azure services, culminating in a Power BI analytical and monitoring solution.
- We went on to create a modular and industry agnostic Digital Twin product, built on Azure big data services, and bringing together Power BI & gaming engines in an immersive user experience that seamlessly ties into existing customer Azure investments.
It is this passion, our focus on R&D and you the customer, which makes us a good partner in your Synapse journey.