SQL Server 2016 and PolyBase

The next release of SQL Server, SQL Server 2016 is continuing with a convention which was employed in previous releases, which I call the Cadillac release system.  At General Motors, in the past new features were first offered on their most luxurious brand, Cadillac, and if these features prove successful, they are rolled out to Buick and the rest of the product lines.  Microsoft does the same thing.  Their ‘Cadillac’  is the PDW [Parallel Data Warehouse], Microsoft’s Data Appliance. One notable example of this release model was the addition of column store indexes to SQL Server. Column store indexes were first available on the PDW, or APS as is was known then, and Microsoft later added column store indexes to SQL Server 2012. Later that same year, at SQL PASS Summit 2012, I heard about a really neat feature available in the PDW, PolyBase. The recording I heard is available here, where Dr. David DeWitt of Microsoft explained PolyBase in great detail. I have been waiting to hear that PolyBase was going to be released to SQL Server ever since.  On May the Fourth, 2015, Microsoft announced the preview release of SQL Server 2016. Listed in the release announcement was the feature I’d been waiting for, PolyBase.

Sqoop Limitations

PolyBase provides the ability to integrate a Hadoop cluster with SQL Server, which will allow you to query the data in a Hadoop Cluster from SQL Server. While the Apache environment provided the Sqoop HadoopSqoopapplication to integrate Hadoop with other relational databases, it wasn’t really enough. With Sqoop, the data is actually moved from the Hadoop cluster into SQL Server, or the relational database of your choice. This is problematic because you needed to know before you ran Sqoop that you had enough room within your database to hold all the data. I remembered this the hard way when I ran out of space playing with Sqoop and SQL Server. From a performance perspective, this kind of data transfer is also, shall we say, far from optimal. Another way to look at Sqoop is that it provides the Hadoop answer to SSIS. After all Sqoop is performing a data move, just like SSIS code. The caveat is SSIS is generally faster than Sqoop, and provides a greater feature set too.

Polybase – Hadoop Integration with SQL Server

Unlike Sqoop, PolyBase does not load data into SQL Server. Instead it provides SQL Server with the ability to query Hadoop while leaving the data in the HDFS clusters. Since Hadoop is schema-on-read, within SQL server you generate the schema to apply to your data stored in Hadoop. After the table schema is known, PolyBase provides the ability to then query data outside of SQL Server from within SQL Server. Using PolyBase it is possible to integrate data from two completely different file systems, providing freedom to store the data in either place. No longer will people start automatically equating retrieving data in Hadoop with MapReduce. With PolyBase all of the SQL knowledge accumulated by millions of people becomes a useful tool which provides the ability to retrieve valuable information from Hadoop with SQL. This is a very exciting development which I think will encourage more Hadoop adoption and better yet, integration with existing data. I am really looking forward SQL Server 2016.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

7 comments on “SQL Server 2016 and PolyBase

    1. Ginger Grant

      Jay –
      Good for you on getting started on SQL Server 2016 already. In order to use PolyBase, you will of course need Hadoop instance for you to connect to, assuming it is on Linux, you will need to configure SQL Server to use this way
      EXEC sp_configure ‘hadoop connectivity’, 5
      GO
      RECONFIGURE
      GO
      You’ll need to ensure the user used to access the Hadoop cluster is listed in the Hadoop configuration file, which you can find here C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Settings\Hadoop.config
      Hadoop access will require you to restart your SQL Server Service. After that you can create the external data sources and access the data like you would any other table.
      Regards,
      Ginger

  1. DB

    Hi,

    i am trying to understand if the PolyBase is the MS take on the concept of Data Lake, and if it is and my understanding is correct, can PolyBase connect to any other sources i.e. Excel doc? Thanks in advance.

    1. Ginger Grant

      DB —

      Generally speaking when people talk about a data lake, they talk about a single repository for all of their data, both structured and not structured so it can be used for analysis. Since PolyBase can allow SQL Server users to access their data stored within an Hadoop HDFS file structure, the data contained within HDFS could be a data lake, but their is no requirement that the Haddop HDFS file store actually will be a data lake. For example the HDFS cluster could merely be where all of the log file records are stored, while the rest of the data is stored in a structured fashion within SQL Server. Because data stored in an HDFS file store is written generally speaking three times, and more often than not on a Linux operating system, there needed to be a tool to access the data stored within HDFS. Sqoop is one way to do that, and now PolyBase is another. In answer to your question, that means PolyBase is specifically designed to connect to HDFS, and no other. If you are interested in joining Excel data to SQL Server Data, the functionality is available using Power Pivot features from within Excel.

      I hope this makes it more clear what PolyBase can and cannot do. Thanks so much for your question as I love to get feedback from people reading my blog.

      Regards,

      Ginger

  2. Pingback: Microsoft SQL Server 2016 – Lista de Novidades – Parte II « Junior Galvão - MVP - SQL Server

  3. Dave Mason

    RE: Polybase – Bringing Together Hadoop and SQL Server 2016 (presentation on March 22nd)

    Hi Ginger,
    Hadoop/Polybase on SQL totally reminds me of using OPENROWSET w/ a format file to “query” files outside of SQL. Am I close?

    Does the external file have to reside in/on Hadoop?

    Dave (@BeginTry)

    1. Ginger Grant

      Dave —

      Polybase is a lot like OPENROWSET, but there are some differences. In answer to your question the external file does have to reside in an HDFS file structure as that is what polybase is designed to do, translate data from a different file structure. The HDFS file structure is different than storing a file on a server, as it contains a unique record layout with redundantly stored data. Not only that, but in the demonstration I did today, the file was stored not only in a different format, but on a different operating system as my HDFS cluster was running in Linux. For companies who have embraced storing their data in Hadoop, now SQL server becomes a way of viewing that data, without needing to use any of the hadoop toolset, like hue or ambari, as users can just continue to use their SQL skills.

      Hopefully that made sense, and thanks so much for your feedback.

      Regards,
      Ginger

Leave a Reply to Ginger Grant Cancel reply