When a data warehouse can work with unstructured data and a data lake can run analytics, how do you decide which to use? It depends on how often you need to answer new questions with data.
Traditionally, a data warehouse collects all the structured data from around your business so you can integrate it into a single data model, run analytics and get business intelligence out — whether that’s for developing new products or marketing existing services to customers. That used to be called ‘big data’, but all enterprises now have large amounts of data coming from sources like ecommerce sites, IoT devices and sensors, so a modern data warehouse needs to handle structured, unstructured and streaming data and offer real-time analytics as well as BI and reporting.
Businesses are increasingly doing that in the cloud for higher speed and lower cost. More and more of that data may be in the cloud already, as well as the services you want to use that data with, points out Azure corporate vice-president Julia White. “Increasingly as data is sitting in and moving to the cloud, whether it’s from SaaS applications or applications just moving to the cloud; the operational data is there and customers are asking ‘why would I take my operational data and offload it from cloud to on-premises just to do my analytics?’ It just doesn’t make sense.” (There’s still plenty of data on-premises and there will be more as edge computing grows, but many customers move some or all of that data to the cloud anyway, White says, depending on compliance issues.)
SEE: Microsoft Power BI: Getting started with data visualization (free PDF) (TechRepublic)
Every enterprise is looking into AI, “and they very quickly realise that analytics is the foundation of that,” White notes. “They start asking ‘what’s the state of my analytics and my data warehouse?’, and it’s often not good enough.”
The popularity of Power BI is also pushing more Microsoft customers to cloud analytics. “When they’ve got these powerful data visualisations, they start questioning their analytics capabilities — ‘I want to know what’s going on behind my data visualisation: I love Power BI and I wish my analytics were more interesting’,” says White.
More sophisticated customers are looking to analyse their own Office Graph data (which you can copy to Azure Data Lake using Azure Data Factory) or take advantage of the Open Data Initiative (ODI) between Microsoft, Adobe and SAP (which is built on Azure Data Lake and will eventually integrate data from many more software vendors). “Azure Data Lake is very tightly coupled with Azure Data Warehouse and customers are using Azure Data Warehouse to get more insights and build the modern data warehouse on top of it,” White says.
Which data service?
Microsoft has a range of cloud services that all look a little bit like a data warehouse, the most obvious being Azure SQL Data Warehouse or ‘DW’ as Microsoft often calls it), but there’s also Azure Data Factory, Azure Data Lake, Azure Databricks, Power BI and Azure Machine Learning, plus more packaged services like the AI sales tools in Dynamics 365.
The way to make sense of them is to look not just at the tools they offer, but also which users they’re serving and how they work together. That’s because, often, the data an enterprise has is fragmented across multiple data stores and the first step of creating a modern data warehouse is to integrate all those siloes. The more of those different data stores that are on Azure, the easier the connections are going to be — which is one reason Microsoft offers so many different data services. The other, White says, is that customers aren’t looking for a single tool that can do everything: “There’s a set of nuanced choices and you’re really going to pick and choose, and optimise what you use for your own scenarios.”
Azure DW is for data engineers working with curated data. That might be data from a SQL Server database, but it might also be data that came from a pipeline built by those data engineers using Databricks or Spark and .NET to prepare data from a source like Azure HDInsight.
Azure Data Factory is another service for data engineers doing data ingestion, transformation and orchestration. Think of it as a cloud-scale ETL tool that you can use through a drag-and-drop interface (under the covers, that’s actually Logic Apps) or with the Python, Java or .NET SDK if you prefer to write code to do the data transformation and manage the different steps of the data pipeline through Databricks or HDInsight, into Azure Data Lake or out to Power BI.
Power BI can also do data transformation using Dataflows (also code free), but that’s intended to be a self-service feature for business analysts. Data engineers or full-time BI analysts might make the semantic models those business users work with, and Microsoft is adding more integration with Azure DW to Power BI.
Power BI users can add AI to their visualisations and reports. Some of that might be using Microsoft’s pre-built Cognitive Services for things like image recognition and sentiment analysis. But they might also be using custom AI models that data engineers have built for them in the Azure Machine Learning service, using all that enterprise data.
A warehouse near the lake
The complexity of these scenarios is why the line between data warehouses and data lakes is starting to look a little muddy in the cloud. A traditional data warehouse lets you take data from multiple sources and use ETL transformation to put that data into a single schema and a single data model in software that’s designed to answer questions you plan to ask over and over again.
Those sources don’t have to be structured, relational data: the PolyBase and JSON support in SQL Server and Azure DW means you can connect data from non-relational stores like HDFS, Cosmos DB, MySQL and MongoDB as well as Oracle, Teradata and PostgreSQL. That means a data warehouse (or even a SQL Server) can look more like a data lake.
Data lakes let you take multiple data stores, both structured and unstructured, ingest them and store them in either their native format or something close to that format, so you have multiple data models and multiple data schema and the flexibility to ask new questions from the same data. (The SQL variant used for Azure Data Lake queries is called U-SQL, not just because it’s the next version after T-SQL, but because you might need a U-boat to go down into your data lake and find out what’s hidden in the murky depths.)
SEE: Microsoft Power BI: Data analytics goes mainstream (Tech Pro Research)
When you have a question you’re going to ask repeatedly (like sales analytics or monitoring delivery times for a dashboard), you can create a data warehouse from the relevant portions of data. But if the question changes over time, or you need to ask new questions, you can go back to the data lake where you keep that original data and create another data warehouse to answer those questions.
The combination of the two is what Microsoft means by a modern data warehouse infrastructure. You can take all kinds of data from different places, work with it in the data lake for things like real-time analytics, or use machine learning to discover patterns that tell you what insights you can get from the data and combine it with the familiar data warehouse tools to answer those questions efficiently.
Microsoft doesn’t have a single service for all that. You can do different parts of it with the various Azure services, which means you can pick and choose the parts you need. But it also means you’ll need to have the data expertise to build your own specific system.
More on Power BI and Microsoft
Click to go to the full article: