<img alt="" src="https://secure.hims1nice.com/151009.png" style="display:none;">
" style="background-color: #2c3e50;">


Analyzing Unstructured Data with PolyBase in SQL 2016

Posted by Lucas Feiock on Nov 2, 2016 10:10:42 AM
Lucas Feiock
Find me on:
PolyBase is a Microsoft technology that can query Hadoop clusters or Azure Blob Storage. This is achieved by using T-SQL statements to import and export data between relational and non-relational data sources. While PolyBase is similar to Hive's HiveQL, Teradata's SQL-H, Apache Drill, Apache Sqoop, or Cloudera Impala it is different. With PolyBase, SQL Server Developers can use existing skills to query Hadoop. The need for SQL Server Developers to learn Hadoop or Linux is not required.
SQL Server 2016 PolyBase*Source: https://msdn.microsoft.com/en-us/library/mt143171.aspx


In the beginning, PolyBase was released with Microsoft's Parallel Data Warehouse (PDW) appliance. The current Microsoft appliance is called Analytics Platform System (APS). These appliances use massively parallel processing (MPP) architectures to create high performance and scalability systems. They are built with multiple servers operating self-sufficiently and control over its memory and disk. This is called a shared-nothing architecture. These systems are more expensive than traditional database servers. PDW and APS have used PolyBase to query Hadoop and other distributed computing systems. This technology is now available with SQL Server 2016.


Connect with KiZAN on LinkedIn for current job postings and the latest Microsoft news.

Performance Improvements

PolyBase and the SQL Server query optimizer will push the computation to Hadoop to increase performance. It uses statistics on the external tables to determine when to leverage Hadoop. With PolyBase a T-SQL statement can be executed and join on multiple MapReduce jobs. This is the benefit over previous solutions as the query engine handles the MapReduce translation. There is another feature of SQL Server 2016 to overcome bottlenecks by using scale-out groups. PolyBase Groups assign one SQL Server as a Head node to run the PolyBase engine and multiple other SQL Server instances to run as Compute nodes. Massive data sets can be queried efficiently by splitting up the workloads to multiple instances. With SQL Server 2016 PolyBase Groups is an Enterprise edition feature, and only Compute node is available for Standard edition.


Big Data Extract, Transform, and Load

Using PolyBase and Columnstore Indexes within a Data Warehouse increase analytic processing by having the proper technologies do what they do best. Store massive data sets in on-premise Hadoop or Azure HDInsight. Execute PolyBase queries to process the data. Those results can then be inserted into a table with a Columnstore index. This analytic ETL solution provides distributed computation and a gain of 10x query performance for reporting.


PolyBase and Azure Blob Storage

Azure Storage is a significant cost savings solution for large datasets. Azure Blob Storage is specialized for handling structured and unstructured data as blobs. Azure Blob storage is cost-effective (see additional pricing details). These storage containers can be scaled up as needed to 500 TB. PolyBase is also able to access Azure Blob storage. This is a great solution for large batch analysis of log files, images, large text documents, social data, or device data with one query. If high throughput is required for accessing Azure Blob storage, one solution is to use Azure ExpressRoute. This will create a private connection between Azure datacenters and your on-premise sites.

Run a PolyBase example query

SQL Server 2016 PolyBase requires Java SE Run Time version 7.51 or higher (64-bit). TCP/IP connectivity must be enabled for communicating remotely to Azure. The following code demonstrates how to set up a connection to a publicly available Azure Blob storage.

SQL Server 2016 with PolyBase enables enterprise-ready data analytics. The combination of tested and proven SQL Server engine with the distributed computing power of Hadoop will allow enterprise access to analyzing all types of data from within SQL Server. Existing SQL skill sets can be used, and the need to learn new technology is not required.

Additional links to get started with PolyBase can be found below.

SQL Server 2016 Developer Edition

PolyBase installation

Get started with PolyBase

Azure SQL Data Warehouse and PolyBase


Have thoughts on this subject? Please comment below.

Posted by Lucas Feiock

Lucas is our Senior Business Intelligence Consultant and has a background in data analytics and data warehousing. He has over fourteen years of experience in Information Technology, including web/application development, help desk support, data analysis, reporting, modeling, and machine learning. He has worked in multiple industries and multiple cross-business teams. Lucas is a frequent speaker at professional conferences and technology events, covering topics such as Azure Databricks, Geospatial Analytics, Azure Analysis Services, Power BI, Azure Machine Learning, and Azure Synapse Analytics. He has a wealth of experience in troubleshooting issues, performance, implementing best practices, system integration, and dashboard reporting. Experience with the following programming languages: SQL, C#, JavaScript, Python, R, Scala, Java, PowerShell, Bash, DAX, MDX

Topics: SQL Server