Microsoft® HDInsight Preview for Windows: How to use Sqoop to load data into HDFS from SQL Server?

In this post, we’ll see how to use Sqoop to load data into HDFS from SQL Server?

With that, here are the steps:

1. You have the Microsoft® HDInsight Preview for Windows Installed on your machine. Here’s a tutorial: Installing HDInsight (Microsoft’s Hadoop) on windows 7

2. Make sure that the Cluster is up & running! To check this, I click on the “Microsoft HDInsight Dashboard” or open http://localhost:8085/ on my machine

Did you get any “wait for cluster to start..” message? No? Great! Hopefully, all your services are working perfectly and you are good to go now!

3. Before we begin, decide on three things:

3a: Username and Password that Sqoop would use to login to the SQL Server database. If you create a new username and pasword, test it via SSMS before you proceed.

3b. select the table that you want to load into HDFS

In my case, it’s this table:

sql table to be loaded into hadoop hdfs from sql server3c: The target directory in HDFS. in my case I want it to be /user/data/sqoopstudent1

You can create by command: hadoop fs -mkdir /user/data/sqoopstudent1

[to learn about how to create directory, read: How to create a directory in Hadoop File System? ]

4. Now Let’s start the Hadoop Command Line (can you see the Icon on the Desktop? Yes? Great! Open that!)

5. Navigate to: c:\Hadoop\sqoop-1.4.2\bin>

*This path may change in future, but navigate to the bin folder under the SQOOP_HOME.

6. Run dir command to see various files under this directory.

sqoop list files under the HOMe directory import export

Also you can run sqoop help for more information on the command that we are about to run.

sqoop list of commands help

7. Now here’s the command to Load data from SQL Server to HDFS:

c:\Hadoop\sqoop-1.4.2\bin>sqoop import –connect “jdbc:sqlserver://localhost;dat
abase=UniversityDB;username=sqoop;password=**********” –table student –tar
get-dir /user/data/sqoopstudent1 -m 1

sqoop command to load data from sql server to hadoop file system

8. After successfully running the above command, let’s browse the file in HDFS!

sqoop see the content of the file

That’s about it for this post!

Thanks

Thanks Aviad Ezra who answered my question on this MSDN thread: An error while trying to use Sqoop on HDInsight to import data from SQL server to HDFS

Conclusion:

In this post, we saw how to load data into Hadoop from SQL Server using Sqoop (SQL Hadoop)

Related Articles:

How to clean data using Regular Expressions in Data Quality Services?

In this blog – post, I’ll share a quick demo of how you can use Regular Expressions in Data Quality services to clean records. For the purpose of the demo, I’ll show how to clean perform a preliminary cleaning procedure on email records.

Before we begin, just a primer on Regular Expressions:

“a regular expression provides a concise and flexible means to match (specify and recognize) strings of text, such as particular characters, words, or patterns of characters. Common abbreviations for “regular expression” include regex and regexp.” – source: WikiPedia Regular Expressions

Let’s take an example:

using Regular Expressions you can:

  • check if a text starts with A or a
  • check if a text contains the word “data”
  • check if a text follows the following pattern:  something@something.something.
  • among other things..

Now, Note the power of Regular Expressions to check for a pattern like something@something.com – now can we not use it to perform a basic cleaning on email records? I think, we can! and so, let’s try that:

for the purpose of the blog post, let me quickly show how you can create a Domain in a Knowledge Base that uses Regular Expressions to clean email records:

1) Open Data Quality Client > New Knowledge Base > Domain Management > create a domain

2) Switch to Domain Rules Tab

3) create a new Domain domain rule which uses the Regular Expression: [A-Za-z0-9._-]+@[A-Za-z0-9.-]+.[A-Za-z] to perform a basic cleaning on email records:

1 sql server data quality cleaning email regular expressions

4) You can use this Knowledge base to clean email records in Data Quality Records But for now,  let’s test our Regular Expression >  click Test Domain Rule > Enter few records > Test them!

2 sql server data quality cleaning email regular expressions test

Note: This is just a basic demo to show Regular Expressions in action. For cleaning Email records, there are other options like using third-party data sets or writing an advanced regular expression. The RegEx I shared was just meant for quick demo but if you want to read more about Regular Expressions used to clean emails then I will strongly recommend you to read email regular expressions that I got to know from via the discussion on a Beyond Relational Tip that I had posted.

Conclusion:

In this blog post, we saw how to do basic cleaning on email records using regular expressions in Data Quality Services

Related articles

How to detect unrealistic or invalid values using Data Quality Services?

When you see a price of an item as -10 (negative 10) – you know it’s not right! It’s BAD data!

When you see some person’s height as 120 inches (10ft!) – you know that’s just not true! Again, BAD data!

It’s not uncommon to see such values in REAL data sets. In this post, we’ll see how SQL Server 2012′s Data Quality Services would help us clean unrealistic a.k.a invalid values in our datasets and bring us a step closer to a CLEAN data!

Situation:

You’re an analyst for professional basketball team Dallas Mavericks and You need a Player Roster of your Basketball players having CLEAN data for analysis purposes. Someone recently passed you a player roster, great! you have data!

Problem:

Some of the Data about Height of players just doesn’t look right! A player with named Dominique Jones had height of just 5 inches.

dallas mavericks player names vs height for data cleaning project

That’s just not right. So for now, you may clean this data-set manually by looking up right values but wouldn’t it be great if you had a Data Cleaning solution that flags unrealistic values for your review every time you get such data-sets? So the analyst decided to build a solution in Data Quality Services – Here are the steps:

Steps to create a Knowledge Base in DQS and clean the invalid values:

Now, First let’s create the Knowledge Base:

You decide to create a rule that checks for player heights that’s less than 5ft and greater than 7ft 6inches for your review. So here’s what the Knowledge Base would look like:

domain management knowledge base domain rules data sqlNow after the Knowledge base (KB) has been published, it’s time to use this KB for a Data Quality Project.

Note: This KB can be re-used for creating KB’s for other teams too.

Here’s the result of running the results on our Data – Set:

3 data cleaning of invalid height values sql server data qualityNote that it says that height has 3 invalid records. Let’s correct them!

In the next tab, you can see the invalid records:

iteratively clean invalid data sql server unrealistic data quality

You can correct this invalid values one by one and then approve them:

specify the valid values approve data quality sql

After that in the next steps, you can see the results – I have highlighted records that we cleaned!

corrected data set clean sql server data quality

And here’s our clean data-set:

dallas mavericks player names vs height cleaned dataset dec 12 2012

Conclusion:

Yay! An analyst of a professional basketball team was able to quickly clean a data-set. And he can re-use the knowledge base he had built to clean other data-sets too in the future! Isn’t that nice?

In this post, we saw how to clean unrealistic or invalid records in a data set by using domain rules in SQL Server 2012 Data Quality Services.

Azure PASS VC Next meeting: Kung Fu Migration to Windows Azure SQL Database

Azure PASS VC’s next meeting:

Kung Fu Migration to Windows Azure SQL Database

Speaker: Scott Klein, Technical Evangelist Microsoft

Summary: As cloud computing becomes more popular and cloud-based solutions the norm rather than the fringe, the need to efficiently migrate your database is crucial. This demo-filled session will discuss the tips and tricks, methods and strategies for migrating your on-premises SQL Server databases to Windows Azure SQL Database, AKA SQL Azure. Focusing primarily on SQL Server Data Tools and the DAC Framework, this session will focus on how these tools can make you a kung-fu migration master.

About Scott: Scott Klein is a Corporate Technical Evangelist for Microsoft focusing on Windows Azure SQL Database (AKA SQL Azure) and related cloud-ready data services. His entire career has been built around SQL Server, working with SQL Server since the 4.2 days. Prior to Microsoft he was a SQL Server MVP for several years, then followed that up by being one of the first 4 SQL Azure MVPs. Scott is the author of over ½ dozen books for both WROX and APress, including Pro SQL Azure. He can be found talking about Windows Azure SQL Database and database scalability and performance at events large and small wherever he can get people to listen, such as SQL Saturday events, local SQL Server user groups, and TechEd.

Details at http://azure.sqlpass.org/

Download the calendar file: http://www.sqlpass.org/iCal.aspx?EventID=245

How to Join Azure PASS VC’s?

If you want to stay updated on meeting announcements, please consider registering on PASS’s website and Joining our VC:

If you do not have a SQLPASS account:

a. Go to http://www.sqlpass.org/RegisterforSQLPASS.aspx

b. Fill up the required information and register

Now, After successful login/registration – Go to http://www.sqlpass.org/MYPASS.aspx

a. switch to MyChapters section

b. Now under virtual chapters, you would see a list of virtual chapters. Join the one’s you are interested in!

my PASS my Chapter Azure VC

I look forward to seeing you at next Azure PASS VC’s meeting!

How to import data from Excel file to SQL Server:

In this post, we would see how to import data from Excel file to SQL Server using Import and Export Wizard, we would import data from four excel worksheets to SQL server. Here are the steps:

1) Right click the Target Database > Tasks > Import Data to open the SQL Server Import and export wizard

import data ssms sql server import export

2) select the Data Source. In this case, Excel file. You would select the worksheets to fetch data from later, for now – select the excel file as shown below:

import data ssms sql server import export

3) choose the Target Database:

choose the target database import and export systems

4) Select the tables (in our case- worksheets) or specify a query. For the purpose of this demo, we would go with “copy data from one or more table or views” option:

 specify table copy or query

5) Select the worksheets from the Excel file and also specify the tables in the target systems. If the tables in the target database are not already created, then would be created by this wizard:

sql server map source and targetsql server map source and target

6) You have the option of running this wizard immediately or you have the option of creating a SSIS package and then run it later. For the purpose of this post, we would select Run Immediately

7) Review the information and click Finish

sql server review the complete wizard

8) After successful execution:

8 sql server import export execution was successful

Conclusion:

In this blog post, we saw how to import data from Excel file to SQL server.

 

Guest Blog: How we use Fuzzy Lookup add-in in our company to solve data inconsistency problems:

This is a Guest Blog from Mantresh Jain.

About Mantresh Jain:

Mantresh Jain is a C Level Executive at SMB in manufacturing domain based out of India. He has bachelor’s degree from a business school. And he holds a special interest in how businesses can leverage newest Information Technology Tools for optimizing business processes. He is working on a company-wide ERP implementation and is a single point of contact for the implementation process. He spends his free time on computer games of all kinds! Link with him here: http://www.linkedin.com/pub/mantresh-jain/43/562/749

 How did they discover Fuzzy Look-up add-in for Excel (A write-up by Paras)?

Some months ago, Mantresh approached me to see if I knew any tool that would help him deal with “messy” data. On Further questions, I learned that

-          Messy data = lots of duplicates

-          Uses SQL Server Express & do NOT have plans to upgrade to SQL Server versions that include Data Quality Services and/or Master Data Services. Remember the context here: They are a small and medium size business.

-          Do use Excel – a lot!

-          Do not have folks w/ “SQL” knowledge

With this requirements, I asked him to see if an add-in for excel called “Fuzzy Lookup” meets their need. After trying it out: here’s Mantresh’s experience of using Fuzzy Lookup add-in for Excel in their organization:

Summary:

In my company we are implementing ERP software. I faced a problem of Data migration from two fox Pro based software’s to SQL (for ERP)

More Details:

Two fox Pro Software’s worked independently form each other. And as a result each of them had their Separate Databases.

Lets Call them FX1 and FX2.

Now I wanted to import Account Master Data from them to SQL, Here are the fields in our Account Master data:

Name, Address, Bank Details, Phone Number among other fields

Problem

Both systems had issues of data Duplication and Data Inconsistency

To give you an example, I faced following problems:

1) FX1 had around 3500 entries and FX2 had 2400 entries

Now in FX1 out of 3500 around 2000 were same as FX2

Also FX2 had around 2000 entries same as FX1

Now i wanted to import only unique Account Master gathered by “combining” the two systems to SQL.

Example:

FX1 has “VMS Industires” while FX2 has “V.M.S Industries”

Solution

Fuzzy Look up add-in for Excel.

Step 1) Import data from both databases to excel

Step 2) Using Fuzzy Look up to find data matching to each other based on variable conditions that we select.

Step 3) It reorganizes data as

FX1 entry 1st matching FX2 Entry
2nd Matching FX2 entry

This is how we  find Duplicate entries and then clean our data-set

Benefit

If not for Fuzzy Look I would have had to manually match each entry to each other which would have taken estimated 60 to 100 Man Hours but with Fuzzy Look-up, we did the job in 24 Man Hours Only.

———————-

Conclusion by Paras:

Thanks Mantresh for sharing your experience!

And here’s a related post:
How to clean similar textual data in Excel via Fuzzy lookup add-in?

Data Quality Servies: What does a locked Knowledge Base indicate?

In this blog-post, we would see what does it mean to lock a knowledge base in Data Quality Services? So the lock on the Knowledge Base indicates that there are unsaved changes in the Knowledge base when you or someone else was working on it.

In the Data Quality Client, Here’s how a lock on the Knowledge Base looks:data quality services knowledge base lock

And here are few points for a locked knowledge base:

1) If you did not lock the Knowledge Base then you can open it in read-only only

2) if you locked the Knowledge base, you can open and edit it. The Knowledge base would be opened in the state that it was closed in.

3) A user working on the Knowledge base can unlock it by publishing it or by unlocking the knowledge base.

4) By positioning the cursor on the knowledge base – you can see who locked it:

who user lock the knowledge base data quality

Conclusion:

In this blog post, we saw what does a lock on a knowledge base in Data Quality Services mean?

How to add a column in an imported Table while developing SQL Server 2012 Analysis services Tabular Model

Do not get intimidated by the long title – it’s not as complex as it sounds! So what am i talking about? Recently, I was developing SQL Server 2012 analysis services Tabular Model (referred to as ‘model’ from now on) and during development I realized that I need to import a column for a table. Let me give an example of what I am talking about:

Consider In the data source there’s a table named ‘Date’ with 15 columns, But for the model I needed only few columns – So I imported only 5 (out of 15) columns from this table:

DateKey
EnglishMonthName
MonthNumberOfyear
CalendarQuarter
CalendarYear

Now During development I figured I need one more column which I had not imported before. Let’s call it FullDateAlternateKey

So How do you add the column ‘FullDateAlternateKey’ in your model during development?

Turns out, It is very simple. This is how you do it:

1. Select the Table in your model designer
2. Go to Table properties.
(If you are not able to see the properties window. Go to View > Properties Windows OR Just select the table and press F4)
3. This is how a properties windows for a Table looks like

properties windows table sql server 2012 tabular mode analyis services

4. Now here click on ‘..’ button after you select the ‘Source Data’ field from the Basic Properties
Source Data edit table sql server 2012 tabular mode analyis services
5. You’ll see a ‘Edit Table Properties’ dialog box. Here you check the column that you wish you import to the model. In my case, I checked the box for the column ‘FullDateAlternateKey’:analysis services check the box for column to be imported 6. Click on OK.

Allow it few seconds and you would see the column in your table! Didn’t I tell you – It’s very simple!

That’s about it for this post. So next time, You need to import a column while developing the model, do not worry – it’s just couple of clicks. But, Even though I was able to add a column later, I had to look at the relationships – create one relationship – make sure other things were right. In short – It was more work. So it’s better that you select the columns that you want in the first place. But If the need arises, you can always add columns (and even remove them) via Table properties. I hope it was helpful!

 

SQL Azure: Indexes are very helpful but they cost $Money$

This blog post is about SQL Azure and not SQL server. And it’s also not about importance of indexes. We know that indexes are important in the world of SQL server as well as in the world of SQL Azure. But in the world of SQL Azure, there is an important concern – when you create indexes, they would occupy “Space” and it will increase the database size. So what’s the point here?  The point is that the pricing model of SQL Azure is database size driven. So bigger the database size, bigger would be your bill. It’s that simple. So to figure out the cost of indexes, we need to know the size of indexes and, We do have a TSQL query to know the size of each index in a database/table.

So here is a TSQL Query that would list indexes in a given database along with its size in KB.

SELECT
SI.name, SUM(PS.reserved_page_count) * 8.0 as "Size in KB"
FROM sys.indexes AS SI
JOIN sys.dm_db_partition_stats AS PS
ON SI.object_id = PS.object_id
AND PS.index_id = SI.index_id
GROUP BY
SI.name
ORDER BY
2 DESC

I ran the above query by logging into the SQL Azure database ‘AdventureWorksLTAZ2008R2′ (which is a sample database for SQL Azure) as a service administrator and here’s the result:

SQL Azure Findding size of IndexesYou can also list indexes along with its size for a particular table in a database. Here’s the TSQL code (Replace SalesLT.product name with your table name):

SELECT
SI.name, SUM(PS.reserved_page_count) * 8.0 as "Size in KB"
FROM sys.indexes AS SI
JOIN sys.dm_db_partition_stats AS PS
ON SI.object_id = PS.object_id
AND PS.index_id = SI.index_id
WHERE SI.object_id = object_id('SalesLT.product')
GROUP BY
SI.name
ORDER BY
2 DESC
GO

Now once you know the size of the index – the next task is to calculate the cost.
The cost of SQL Azure is dependent on the size of a database and here is the pricing model: https://www.windowsazure.com/en-us/pricing/details/#database

And one can definitely embed the pricing logic in the TSQL Query we just saw. check this article for reference: http://blogs.msdn.com/b/sqlazure/archive/2010/08/19/10051969.aspx – Just note that the pricing model is slightly different now. The pricing model is same for database size upto 50 GB. But when you above 50 GB (upto MaxSize 150 GB) there is NO additional cost. so essentially. 50 GB db = $499.95 per month and 150 GB = $499.95 per month. Update your TSQL queries accordingly!

So, Make sure that you use indexes that you have created And Drop indexes that are not used. you need to evaluate performance gain vs cost for your scenario.

Note:

  • I am not implying that Indexes on SQL server are “free”. They consume resources in the world of SQL server too. This post just happen to focus on SQL Azure side of it.
  • The Pricing Model of SQL Azure may change in future.
  • You also need to evaluate performance impact of creating indexes on create, update and delete queries

Summary – SQL server to SQL Azure migration << Paras Doshi

In this blog post, I aim to summarize database migration options that I have been blogging about for past few weeks. Choosing right tool is a key decision when you decide to migrate SQL server to SQL Azure – And I hope this blog post can help you decide which is the best tool to be picked for your scenario:
 
Here is the summary:
 
Scenario/Tool Generate script wizard SSIS BCP DAC v2.0
Migrate SQL schema YES YES   YES
Migrate Data YES YES YES YES
Migrate large Data   YES YES YES

 

Blog Posts featuring tools that help migrate SQL server database to SQL Azure:

BCP:

Migrating SQL server Data to SQL Azure using BCP << Paras Doshi

SSIS:

How to use SQL server Integration services (SSIS) to migrate data from SQL server to SQL Azure << Paras Doshi

Generate script wizard:

How to use Generate script wizard to migrate SQL server Database to SQL Azure << Paras Doshi

Dac’s v2.0 (SQL Azure Import Export CTP):

[ SQL Azure Labs ] Exploring SQL Azure Import/Export Feature (CTP)

 

Related posts:

Migrating TSQL file and SQL server 2005 / 2008 Database to SQL Azure using SQL Azure Migration wizard << Paras Doshi

“Migrating Data into Microsoft’s Data Platform – SQL Azure” << SolidQ Journal Article

 

CrossPosted From: Summary – SQL server to SQL Azure migration << Paras Doshi

 

Can you please consider posting your valuable feedback/comments/suggestion in comments section? Thanks!