PolyBase – Another Method for Creating a Stretch Database in SQL Server 2016

PolyBase, which was released with SQL Server 2016, provides another method to access live data either locally or in the cloud, very similar to the SQL Server Stretch database feature. Polybase can also provide the ability to provide a more cost-effective availability for cold data, streamlines on-premises data maintenance, and keeps data secure even during migration. Polybase differs from Stretch database in a few ways, as the SQL must be different, the speed is noticeably slower, and it is a lot less expensive. The cost is significantly less because storing data in a Azure blob store starts at 1 cent a month and Stretch database starts at $2.50 an hour. In this post,I will show how to take data which was archived due to the age of the data, which was created in 2012 and store it in an Azure Blob Storage file which will be available via Polybase when I needed.

Implementing a PolyBase Stretch Database

PolyBase is an optional feature of SQL Server 2016, and the Instance Feature PolyBase Query Service for External Data needs to be selected as part of the installation process. Two services are ssmsexternalinstalled with the feature, SQL Server PolyBase Engine and SQL Server PolyBase Data Movement. Both of these services must be running and TCP/IP must be enabled for PolyBase to work. Either check for those services or run the query SELECT SERVERPROPERTY (‘IsPolybaseInstalled’) AS IsPolybaseInstalled; which will return a 1 when PolyBase has been installed. The next step is to tell SQL Server what the external source is by configuring which Hadoop Connectivity will be used.


EXEC sp_configure ‘hadoop connectivity’, 7;
GO
RECONFIGURE;

To use Azure Blob Storage, there are 3 different options, 1,4 and 7, which include the ability to access blob storage. These options also allow for various other Hadoop engines to be accessed as well. I have selected 7, which also allows me to use Hortonworks on Linux. For more information on the connectivity options, check out this link. A secure database ke is required to make the connection a well. The next step is to let SQL Server know where exactly the blob storage data is stored, which creates an entry in the External Data Sources. PolyBase needs to know how the data is formatted, which will be stored in an External File Format. By definition, Hadoop data has no schema, so a schema is going to have to be created and stored in an external table. The location for each of these items is stored within SQL Server Management studio as shown here.

This code will create a secure key using a password I made up

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'P@ssW0rdPolybase'

A PolyBase will need the Azure Blob Storage key to be able to make a connection.

CREATE DATABASE SCOPED CREDENTIAL AzureStorageCred
WITH IDENTITY = 'user', Secret = 'lEGL66LiK2KE2U0WEb435PH15BwFOInrHqQ1AJigjoRVfEOrOge+TLbBNu861cqbC+a26io92o5sw/b+OTD/C4w=='; --Note: Replace with your Blob Storage key


CREATE EXTERNAL DATA SOURCE AzureBlobStorage with (
TYPE = HADOOP,
LOCATION ='wasbs://gingeriottest@gingertestiotblobstore.blob.core.windows.net', --This is the location of the folder inside blob storage containing my data
CREDENTIAL = AzureStorageCred
);

The data is stored in blob storage as a CSV, so I will need to create that file Format

CREATE EXTERNAL FILE FORMAT CSV WITH (
FORMAT_TYPE = DELIMITEDTEXT
,FORMAT_OPTIONS (FIELD_TERMINATOR =',', DATE_FORMAT='MM/dd/yyyy') )

Lastly, a table definition must be created for the file so that it can be accessed as a table

CREATE External TABLE OntimePerformance2012sampleall(
[DimAirlineKey] int,
[DimOriginAirportKey] int,
[DimArrivalAirportKey] int,
[DimCancellationReasonKey] int,
[DimDelayLengthKey] int,
[DimDepartureBlockKey] int,
[DimArrivalBlockKey] int,
[DimDistanceGroupKey] int,
[FlightDateKey] int,
[FlightNumber] varchar(50),
[ScheduleDepartureTime] int,
[ActualDepartureTime] int,
[DepartureDelayInMinutes] int,
[TaxiOutTime] int,
[TaxiInTime] int,
[ScheduleArrivalTime] int,
[ActualArrivalTime] int,
[ArrivalDelayInMinutes] int,
[ScheduleElapsedTime] int,
[ActualElapsedTime] int ,
[DistanceInMiles] int,
[CarrierDelayInMinutes] int,
[WeatherDelayInMinutes] int,
[NASDelayInMinutes] int,
[SecurityDelayInMinutes] int
)

WITH
(   LOCATION = '/OntimePerformance2012.csv',
DATA_SOURCE = AzureBlobStorage,
FILE_FORMAT = CSV
)

 

All of the external tables created can be found in the DMV sys.external_tables or by looking at sys.tables where the new column is_external is set to 1.  Once the setup of PolyBase is complete, I can query the external and internal tables all at the same time, like I do here.

 

SELECT count(*)
, r.cancellationReason
, d.CalendarYear
FROM dbo.CancellationReason r
JOIN dbo.OntimePerformance2012sampleall c
ON c.dimcancellationReasonKey = r.CancellationReasonKey
JOIN dbo.DimDate d
ON c.flightDateKey = d.DateKey
GROUP BY r.cancellationReason, d.CalendarYear

PolyBase Table Considerations

It is not possible to edit the data in PolyBase, as the data is stored read only. Also since the data is being pointed to externally, it cannot be indexed and querying it is not terribly quick. All of the data will be brought down and then filtered, unless PolyBase is configured to improve performance, which I will discuss in a future blog post.

PolyBase and Azure Blob Storage provides the ability to access data stored offsite in a secure location in a cost effective manner. If you saw my presentation on this topic at PASS Summit or have questions how to make it work in your environment, please contact me on twitter or write a comment on my blog.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

SQL Server 2016 Polybase Setup

After reading my last post you should be all ready to install SQL Server 2016 to be able to use Hadoop.  I went through all of these steps in my webinar, but I thought it might also be a good idea to include them here so you don’t have to watch the video.

Polybase Install Feature

When installing SQL Server in the Feature Selection List shown below,  PolyBase Query Service for External Data must be selected. SQL Server 2016 Polybase

To check to see if Polybase has been successfully installed, go to Control Panel->Administrative Tools->Services.  There are two services added for polybase, SQL Server PolyBase Data Movement and SQL Server Polybase Engine, as well as  a bunch of other new ones for SQL Server 2016. The polybase services and SQL Server will need to be restarted in a later step.  When starting SQL Server Management Studio for SQL Server 2016, it is hard to spot the differences between it and previous versions of SQL Server as even the icon looks exactly the same. Once you create a database for SQL Server, you will notice a slight difference under the tables folder. When polybase is installed there are two new folders, External Tables and External Resources. Data which is accessed via polybase will be stored under the External Files folder and the External Resources will contain the references to the external resources, the HDFS cluster where the data is stored, as well as the file formats of the underlying data.

Configuring Polybase on SQL Server

SQL Server needs to be configured to use polybase by using the sp_configure command

EXEC sp_configure ‚'hadoop connectivity', 5;
GO
RECONFIGURE; 

The number 5 indicates the kind of Hadoop connectivity desired.  If you were using HDInsight’s Azure Blob storage or Hortonworks on Windows, the number listed would be 4.  The number 5 indicates connectivity to Hortonworks Data Platform on Linux.  Go ahead and exit out of SQL Server as you will be restarting it soon.

Configuration File Modification

Open up your favorite text editor to modify the Hadoop.config file.  The default location is

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Settings\Hadoop.config

Keep in mind, if you installed SQL Server with a named instance the path will include the name of that instance.  The config file contains a default password for Hadoop.  The password, which is initially set to pdw_user is probably a holdover from pre-2016 days when polybase was only available on Microsoft’s Big Data Appliance, the Analytics Platform System [APS] which was previously called Parallel Data Warehouse [PDW].  Hortonworks’ default password is hue, so you will want to modify the file so that the HadoopUserName has the right password, hue. I circled it below in a clip of the Hortonworks.config file.

Once the changes to the Hadoop.config file are saved, to get polybase to work, the last thing which must be done is to restart the two new polybase services and SQL Server 2016.

Setting up the External Data Source and External Data Files

At this point, we can now tell SQL Server where the Hadoop files are loaded. Open SQL Server Management Studio [SSMS] and create a new query to create the new an external data source, which is going to appear in the new folder, External Data Source.

CREATE EXTERNAL DATA SOURCE HDP2 WITH
(
TYPE = HADOOP,
LOCATION = 'hdfs://sandbox.hortonworks.com:8020'
)

After running this command and refreshing, the source HDP2 will appear in the folder External Data Source ->Data Sources

SQL Server needs to know what the underlying data file formats of the data stored in Hadoop. Assuming the data is stored in a tab delimited format, this command will tell SQL Server how to read the tab delimited data and the format of the dates.  This will allow polybase to read a sample file which comes with Hortonworks, the file SAMPLE_07

CREATE EXTERNAL FILE FORMAT TSV
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '\t',
DATE_FORMAT = 'MM/dd/yyyy'
)
)

SQL Server needs a schema to read the data, as it doesn’t know where the fields are without it. The following code creates a table which will appear under the External Tables folder in SSMS as well as load the data by telling it where the data lives and to use the file format which we just added.

CREATE EXTERNAL TABLE Sample_07
(
code nvarchar(255),
description nvarchar(255),
total_emp int,
salary nvarchar(255)
)
WITH
(
LOCATION = '/apps/hive/warehouse/sample_07',
DATA_SOURCE = HDP2,
FILE_FORMAT = TSV,
REJECT_TYPE = value,
REJECT_VALUE=0
)

After this step is complete, you can use the new table to join to data on SQL Server from inside a HDFS cluster.

I hope this helps you get started with using polybase with SQL Server 2016.  Let me know what you think by posting a comment.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

DIY – Getting started with Hadoop and Virtual Machines

In my recent presentation on Polybase in SQL Server 2016, I demonstrated how to get started using Hadoop with SQL Server 2016. In this post I will show you everything you can do to run the demonstration yourself. The first step is to create a virtual machine so that you can run a Linux instance for Hadoop. As I know that installing a virtual machine can be intimidating, this post explains what you need to do, and how to fix a problem you may run into when running a virtual machine.

Creating Your Own Virtual Machine

Previously, spinning up a virtual machine meant purchasing software. No more, as there is now an open source application. In the example shown here, the Linux operating system will be installed, you can put any operating system you want on your virtual machine, provided of course you have a license for it. If you don’t feel comfortable installing non-released versions of code like SQL Server 2016, on your pc, a virtual VirtualBoxmachine is a great way to test it out. You will need to provide your own operating system, but there are trial versions you can use for limited periods of time as well. The open source virtual machine Oracle VM Virtual Box is the only open source version of a virtual machine software. You can download it here. This software is needed prior to installing the Hortonworks Sandbox. Obviously Hortonworks is not the only version of Hadoop available, Cloudera has a Hadoop VM too, which you can download as well. Personally I am not a use fan of the Cloudera Manager, which is why I prefer Hortonworks, but either will work with polybase.

Troubleshooting Why the Virtual Machine Won’t Run

If you are using a Windows operating system, chances are it may be running Hyper-V. Hyper-V is Microsoft’s Virtual Machine. For you to be able to use Hyper-V, you will need to have a virtual machine file which is saved in the VHD format so you can load the image in Hyper-V. Hortonworks provides two versions of their sandbox, one in the VM Ware version and one in the Virtual Box format. For this reason, Hyper-V is not an option as a virtual machine for Hortonworks. If you are running Hyper-V, you can’t run another VM.

The boot configuration data store could not be opened. Access is denied

The error message received when Hyper-V is running is pretty cryptic. What does “The boot configuration data store could not be opened. Access is denied” supposed to mean anyway? When wandering around the internets trying to find an answer this question, there is a lot of <sarcasm>wonderful advice</sarcasm> which states that the only thing to do is to replace your computer entirely as the bios won’t support Virtual Machines, even after you check the bios and find out that yes, your bios does support virtualization. Don’t be deterred. The error can be fixed without new hardware. To resolve this error,  Hyper-V needs to be turned completely off. In case you were wondering, stopping the Hyper-V services won’t fix this. Instead Hyper-V must be disabled at the command prompt, then a reboot is required. Make sure you run the command prompt as administrator, because the command won’t run if you don’t. To run the command prompt as administrator, search for CMD. When the result comes back, right click on the cmd selection and select run as administrator. To turn off Hyper-V, run this command

bcdedit /set hypervisorlaunchtype off

Exit out of the command prompt. Remember the error won’t be resolved until after rebooting. After that the reboot, the boot configuration message won’t appear when running the VM, and you can successfully get started running the Hortonworks Sandbox.

TL;DR – Links

If you just want to get started running a VM and polybase and here are the links needed to make that happen.

Oracle Virtual Box

Hortonworks Sandbox

Cloudera Quickstart

SQL Server 2016 RC1

Microsoft Server 2016 Preview

Once the VM running is running with Hadoop, install SQL Server 2016 so that you can follow along on my next post where I talk about how to use polybase.  If you want to be notified when that post will be available, please subscribe to my blog and you will find out automatically.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

2016 Releases of SQL Server and Excel

It’s been 2016 for over a month now, so hopefully you are still not changing 2016your 5 into a 6 still. Chances are you haven’t changed your software to reflect the new year yet. Office 2016 came out in September of 2015 and SQL Server 2016 hasn’t been released yet. It’s hard to keep up with all of the version changes that have come out, especially when you throw in Power BI which is has something new every month. If you know you are going to be upgrading to one or both of these versions, or want to learn more about SQL Server or Excel so you can decide if  is worth the upgrade effort, this week I might be able to help out. I am going to be talking about the 2016 version of SQL Server on Wednesday, February 10 at the Arizona SQL Server User Group meeting and then talk about the 2016 version of Excel on Thursday at the Excel BI SQL Pass Virtual Chapter. If you are not in Arizona right now, you are missing out as we are having Department of Tourism weather of 80 degrees. You can get back to me in August when I am melting in the 115 degree heat.

Polybase in SQL Server 2016

Since there are many new features to talk about in SQL Server 2016, I picked Polybase. As big data matures many places are looking to keep their structured data right where it is and create an HDFS cluster to store other data. Polybase allows SQL Server 2016 users to look at both all in one place.

Excel 2016

It’s been a while since September 22, 2015, the date Office 2016 was released , but I still know very few people who have upgraded. I’ve been to a few clients that hope to upgrade to Excel 2012 this year. In this session, I will show where things got moved and renamed, what’s new and what is on the deprecated list. If you don’t have 2016 installed yet, or if you do and wonder where Power Query went, please join me to hear all about it. Generally speaking, the Virtual Chapters are posted on Youtube, and when they are I will have a link available. Unfortunately for those who attended my last Excel BI Virtual Chapter Meeting, due to technical difficulties that recording is not available, but hopefully this time everything will work.  When the recording is available I will make sure a link it is available on my blog for those who can’t make it.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

What is the difference between Machine Learning and Data Mining?

An Example of Machine Learning: Google's Self-Driving Car

An Example of Machine Learning: Google’s Self-Driving Car

Often times when I give a talk about machine learning, I get a question about what is data mining and what is machine learning, which got me to thinking about the differences. Data mining has been implemented as a tool in databases for a while. SSIS even has a data mining task to run prediction queries on an SSAS data source. Machine Learning is commonly represented by Google’s self-driving car. After reading the article I linked about Google’s car or study the two disciplines, one can come to the understanding that they are not all that different. Both require the analysis of massive amounts of data to come to a conclusion. Google uses that information in the car to tell it to stop or go. In data mining, the software is used to identify patterns in data, which are used to classify the data into groups.

Data Mining is a subset of Machine Learning

There are four general categorizations of Machine Learning: Anomaly Detection, Clustering, Classification, and Regression. To determine the results, algorithms are run against data to find the patterns that the data contains. For data mining the algorithms tend to be more limited than machine learning. In essence all data mining is machine learning, but all machine learning is not data mining.

Goals of Machine Learning

There are some people who will argue that there is no difference between the two disciplines as the algorithms, such as Naïve Bayes or Decision trees are common to both as is the process to finding the answers. While I understand the argument, I tend to disagree. Machine learning is designed to give computers the ability to learn without specifically being programmed to do so, by extrapolating the large amounts of data which have been fed to it to come up with results which fit that pattern. The goal of machine learning is what differentiates it from data mining as it is designed to find meaning from the data based upon patterns identified in the process.

Deriving Meaning from the Data

As more and more data is gathered, the goal of turning data into information is being widely pursued. The tools to do this have greatly improved as well. Like Lotus 123, the tools that were initially used to create machine learning experiments bear little resemblance to the tools available today. As the science behind the study of data continues to improve, more and more people are taking advantage of the ability of new tools such as Azure Machine Learning to us data to answer all sorts of questions, from which customer is likely to leave aka Customer Churn or is it time to shut down a machine for maintenance. Whatever you chose to call it, it’s a fascinating topic, and one I plan on spending more time pursuing.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

SQL Server 2016 and PolyBase

The next release of SQL Server, SQL Server 2016 is continuing with a convention which was employed in previous releases, which I call the Cadillac release system.  At General Motors, in the past new features were first offered on their most luxurious brand, Cadillac, and if these features prove successful, they are rolled out to Buick and the rest of the product lines.  Microsoft does the same thing.  Their ‘Cadillac’  is the PDW [Parallel Data Warehouse], Microsoft’s Data Appliance. One notable example of this release model was the addition of column store indexes to SQL Server. Column store indexes were first available on the PDW, or APS as is was known then, and Microsoft later added column store indexes to SQL Server 2012. Later that same year, at SQL PASS Summit 2012, I heard about a really neat feature available in the PDW, PolyBase. The recording I heard is available here, where Dr. David DeWitt of Microsoft explained PolyBase in great detail. I have been waiting to hear that PolyBase was going to be released to SQL Server ever since.  On May the Fourth, 2015, Microsoft announced the preview release of SQL Server 2016. Listed in the release announcement was the feature I’d been waiting for, PolyBase.

Sqoop Limitations

PolyBase provides the ability to integrate a Hadoop cluster with SQL Server, which will allow you to query the data in a Hadoop Cluster from SQL Server. While the Apache environment provided the Sqoop HadoopSqoopapplication to integrate Hadoop with other relational databases, it wasn’t really enough. With Sqoop, the data is actually moved from the Hadoop cluster into SQL Server, or the relational database of your choice. This is problematic because you needed to know before you ran Sqoop that you had enough room within your database to hold all the data. I remembered this the hard way when I ran out of space playing with Sqoop and SQL Server. From a performance perspective, this kind of data transfer is also, shall we say, far from optimal. Another way to look at Sqoop is that it provides the Hadoop answer to SSIS. After all Sqoop is performing a data move, just like SSIS code. The caveat is SSIS is generally faster than Sqoop, and provides a greater feature set too.

Polybase – Hadoop Integration with SQL Server

Unlike Sqoop, PolyBase does not load data into SQL Server. Instead it provides SQL Server with the ability to query Hadoop while leaving the data in the HDFS clusters. Since Hadoop is schema-on-read, within SQL server you generate the schema to apply to your data stored in Hadoop. After the table schema is known, PolyBase provides the ability to then query data outside of SQL Server from within SQL Server. Using PolyBase it is possible to integrate data from two completely different file systems, providing freedom to store the data in either place. No longer will people start automatically equating retrieving data in Hadoop with MapReduce. With PolyBase all of the SQL knowledge accumulated by millions of people becomes a useful tool which provides the ability to retrieve valuable information from Hadoop with SQL. This is a very exciting development which I think will encourage more Hadoop adoption and better yet, integration with existing data. I am really looking forward SQL Server 2016.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Azure Data Lake: Why you might want one

On April 29, 2015 Microsoft announced they were offering a new product Azure Data Lake. For those of us who know what a data lake is, one might have thought that having a new data lake product was, perhaps redundant, because Microsoft already supported data lakes with HDInsight and Hadoop. To understand why you might want a separate product, let’s look at what a data lake is.  I think the best definition of a data lake that I read recently was here. Here’s the TL;DR version “A ‘data lake’ is a storage repository, usually in Hadoop, that holds a vast amount of raw data in its native format until it is needed.” Ok so here’s the question, one  can spin up an HDInsight Hadoop cluster on Azure and put all of your data there, which means you can already create a data lake. Since you can already create a data lake, why did Microsoft go and create a new product?

Hardware Optimization and the Data Lake

If you look at Microsoft’s most recent Azure release, you’ll see they are releasing products designed to operate together. Service Bus, Event Hubs, Streaming Analytics, Machine Learning and Data Factory are designed to process lots of data, especially a lot of short pieces of data, like Vehicle GPS messages, or other types of real time status messages. In reading the product release for Azure Data Lake, they highlight it’s ability to store and more importantly retrieve this kind of data.  DataFactory The difference between the HDInsight already on Azure and the Data Lake product is the hardware dedicated to make the storage and the integration designed to improve access to the data. Data Factory is designed to move your data in the cloud to anywhere, including a data lake. If you look at the graphic Microsoft provides to illustrate what Data Factory is designed to integrate, the rest of the outputs listed have products associated with them. Now there is a product associated with the data lake too. Data lakes are designed to store all data, but unlike a database operational data store, data lakes are designed to have the database schema applied when the data is read, not when the data is written. This allows for faster writing of the data, but it does tend to make accessing the data slower. The Azure Data Lake hardware, according to the release, is designed to address this issue by providing computing power designed for massively parallel processing to provide the data when needed, which would be on the reading and analysis of the data, not when it is written. This sort of targeted computing power differs from the HDInsight Hadoop offering, which is uses a standard hardware model for storage and access. By tailoring the hardware to meet the needs of the specific type of data stored, in theory this will greatly improve performance, which will increase the adoption of not only the Azure Data Lake, but the tools to analyze and collect the data too. It’s going to be interesting to see how the marketplace responds as this could really push massive amounts of data to the Azure cloud. Time will tell.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

 

Azure ML, SSIS and the Modern Data Warehouse

Recently I was afforded the opportunity to speak at several different events, all of which I thoroughly enjoyed. I was able to speak on Azure Machine learning first at the Arizona SQL Server Users Group meeting. I really appreciate all who attended as we had quite a crowd. Since the meeting is held MachineLearningTalkpractically on Arizona State University’s Tempe Campus, it was great to see a number of students attending, most likely due to Ram’s continued marketing efforts on meetup.com. After talking to him about it, I was impressed at his success at improving attendance by promoting the event on Meetup, and wonder if many SQL Server User Groups have experienced the same benefits. If you have, please let me know. Thanks Joe for taking a picture of the event too.

Modern Data Warehousing Precon

The second event where I had the opportunity to talk about technology was at the Precon at SQL Saturday in Huntington Beach, where I spoke about Modern Data Warehousing. It was a real honor to be selected for this event, and I really enjoyed interacting with all of the attendees. Special thanks to Alan Faulkner for his assistance. We discussed the changing data environment including cloud based storage, analytics, Hadoop, handling ever increasing amounts of data from different sources, increasing demands of users, the review of technology solutions demonstrate ways to resolve these issues in their environments.

Talking and More Importantly Listening

The following day was SQL Saturday in Huntington Beach #389. Thanks to Andrew, Laurie, Thomas and the rest of the volunteers for making this a great event as I know a little bit about the work that goes into planning and pulling off the event. My sessions on Azure ML, Predicting the future with Machine Learning and Top 10 SSIS Tuning Tricks were both selected and I had great turnout on both sessions. To follow-up with a question I received during my SSIS Session, Balanced Data Distributor was first released as a new SSIS transform for SQL Server 2008 and 2008 R2, so you can use it for versions prior to SQL Server 2012. I’ve posted more information about it here. I also got a chance to meet a real live data scientist, the first time that has happened.  Not only did I get a chance to speak but a chance to listen. I really enjoyed the sessions from Steve Hughes on the Building a Modern Data Warehouse and Analytics Solution in Azure, Kevin Kline on , and Julie Koesmarno on Interactive & Actionable Data Visualisation With Power View. As always it’s wonderful to get a chance to visit in person with the people who’s technical expertise I read. In addition to listening to technical jokes which people outside of the SQL community would not find humorous, it’s great to discuss technology with other practitioners. Thanks to Mr. Smith for providing me a question which I didn’t know the answer, which now I feel compelled to go find. I’ll be investigating the scalability of Azure ML and R so that I will be able to have an answer for him next time I see him. I really enjoy the challenge of not only investigating and applying new technology but figuring out how to explain what I’ve learned. I look forward to the opportunity to present again, and when I do I’ll be sure to update this site so hopefully I get a chance to meet the people who read this.
Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

What is a Modern Data Warehouse?

As I was honored enough to be selected to give a PreCon on the Internals of the Modern Data Warehouse, I thought that I would take the time to explain why I felt drawn to the topic. There are a lot of places that haven’t given much thought to the changes in technology which have happened over the last few years. The major feature upgrades to SQL Server in 2012 and 2014 have meant that they can use column store indexes which makes things faster and maybe better High Availability. While those things are certainly valuable improvements there is a lot more that you can do to derive value from your data and companies want more than just a well-organized, running data warehouse.

Data is a Valuable Asset

In 2010, Borders Group Inc. was allowed by the Federal Trade Commission to sell their customer information to Barnes and Noble as part of their bankruptcy sale of their assets. In 2015, RadioShack is doing the same thing. Businesses understand that data is valuable and they are interested in using it to drive decision making. Amazon, Netflix and Target are well known for their use of customer information to drive sales, but they are far from the only ones doing this. This is one of the bigger trends identified recently in the business press. The heads of companies are now looking for their data teams to do more with their data so that they too can have the dream information systems they are reading about.

Total Destruction of the Existing DW is Not Required

Excavator working with earth and sand in sandpitWhile a lot of the time, it might be nice to level everything and start over, that is not always an option. The major reason for this is that the data warehouse environment already in place has a lot of value. You want to add to the value already there, not destroy what you have. Also it would take a long time to recreate the environment and no one is patient enough to wait for that. Alternatively you could expand into areas of new technology as your data grows. Perhaps this mean you archive some of your data from your database to a Hadoop cluster instead of backing up the data in some far off location. This would allow you to use Sqoop to bring the data back when you need it, providing ready access to the data. Perhaps you want to provide the users more self-service BI capabilities, moving the data analysis into the hands of the people who are more familiar with the data? You could add the capabilities of Power View in Excel, Power Designer or Tableau to your environment.

Incorporating Social Media Information

The business world operates not only on a batch cycle. More and more companies want to know what is being said about them so they can respond appropriately. With tools like Azure Event Hubs, Data Factory, Streaming Analytics, and Machine Learning this isn’t as hard to do as it might sound. We’ll review these products so that attendees will understand how these tools can provide greater insight not only into their own data, but the data building about them outside of the company firewall.

For More Information

I really hope you can join me in Huntington Beach on April 10 for a full day of exploring these concepts. I always look forward to events like the precon and of course SQL Saturday #389 – Huntington Beach which is the following day.

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Upcoming and Up and Coming Topics

It’s funny the different meanings words have when you put them in different order, a point which anyone who has imitated the dialectic of Yoda can tell you. I find words fascinating as they are not static but have meanings which change over time. For example the Iron Maiden meant something totally different before there were electric guitars. Thinking of works and things changing, as one year closes and another year begins, I start to evaluate past and future topics. Earlier this year, I held an informal poll on twitter to find out how long people tend to talk on the same topic. The answers were quite varied. Some people keep on talking about the same topic as long as there seems to be interest in hearing about it. That way you can get to be a really good speaker on that topic. Another feels obligated to create a new topic each time out to provide him a challenge. The answer that personally I related to, was keep on talking about the topic until you are tired of hearing about it, which takes about a year.

SQL Saturday Albuquerque

sqlsat358_ABQMy first upcoming engagement for 2015 will be as SQL Server Albuquerque where I will be talking about SSIS. I generally talk about things I am interested in or presently working on, and having working on a lot of ETL recently, I thought that it would be an interesting topic which I think most people would find helpful. As a consultant, I see a lot of code and wonder why parts of it were written that way. One big reason is someone thought the design was a good one. Since that is an objective decision, I thought it might be helpful to clarify design decisions with facts so that that people would be able to employ good logic for their design decisions.

Technology changes and their Impact on Data Development

Another topic which really interests me is the changes that new technologies are having on the database world. With the increased implementation of Hadoop and cloud things are really changing in the way data is being both stored and used. Predictive Analytics, Machine Learning, Cloud implementations, Interactive Data visualizations are changing what people are expecting from the way their data is stored and used. Expectations for data professionals are increasing as the business is looking away from HIPPO and towards the knowledge that they have gathered or integrated data from public sources.

Modern Data Warehouse

I have the pleasure of assisting in a day-long session to talk about Architecting the Modern Data Warehouse . During this one day session we will be showing how to use new technology such as HD Insight and Machine Learning to implement a modern data warehouse. Instead of just talking about new technologies we will be putting them to use to show how they can be used today. I’m really looking forward to it.

If you are able to attend any of these or any upcoming sessions, please stop by and introduce yourself as I would love to meet readers of my blog in person.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur