Snowflake

Snowflake Migration using StreamSets

#OnTapToday, using @StreamSets to migrate data from your source database to Snowflake. In this post, I will walk through the process of configuring StreamSets to migrate data from my on prem DB to Snowflake.

WHY

You need to get some data from a source database to the Snowflake data warehouse. There were a few options at my disposal. I could have:

StreamSets operates on a Hybrid OpenSource Model (HOSS). StreamSets Data Collector is open source, released under the Apache 2.0 license. The remainder of the platform source code is proprietary and not available to customers

In this case, I chose StreamSets.

SETUP

The setup was actually quite complex on my Macbook as there were some hacks necessary on the OS to allow adjustments to the java heap size. Rumor has it, this is just an issue with Mac OS, and other linux OSes are fine. Fortunately, all this can be found with some google searches.

Here are the installation instructions:

  • Download, install, and run Data CollectorRemember on a Mac there are some additional OS configurations necessary to get around the heap size issue. I’ll try to locate the link that helped me out and update the post.
    • Once you have the Data Collector installed properly, you need to launch it with the following command:
      • bin/streamsets dc

  • Launch the Data Collector, via the URL presented after the service starts and click on the Package Manager icon.

 

  • In the search box, enter “Snowflake”, select Snowflake Enterprise Library 1.1.0 from the list, click Install icon and accept Terms of Service*. Once the installation completes, follow instructions to restart Data Collector.

  • After the Data Collector restarts, we can create the data migration pipeline.

Since this a simple multi-table migration, we can build this pipeline by connecting a single Origin and Destinations. In this case, we will use the JDBC Multitable Consumer origin and the Snowflake Replicate destination.

  • From the Data Collector GUI, select Origin in the upper right corner and enter JDBC in the search box. Double click the JDBC Multitable Consumer icon to place it in the workspace. In this example, to make things simpler, I will be connecting to an already existing Oracle database running in my basement.

  • Next select Destinations from the same dropdown in the upper right and search for Snowflake. Again double click the Snowflake icon to place it on the workspace.

  • Finally connect the JDBC Origin and the Snowflake Destination by clicking and dragging a connector between the two.

As you can see, there are a number of tab associated with the two phases of this migration pipeline. Complete these with the appropriate values.

JDBC Multitable Consumer:

JDBC

Tables:

Leave all other tabs with any defaults

Snowflake

Snowflake Connections Info

Snowflake

Staging

Leave all other tabs with any defaults

 

Now let’s switch over and set up the Snowflake data warehouse. If you do not have an account, you can visit this link to start a free trial. Once you have successfully logged in, simply copy and paste this following set of into the worksheet.

We are now ready to run the StreamSets job. You will notice that we did not create any tables in the Snowflake data warehouse. One of the options we select in the StreamSets configuration was to automatically create the tables we are migrating.

 

Migrate

Back on the StreamSets Data Collector interface, click Start in the upper right.

The process will begin migrating the data from the source database, in this case, Oracle to Snowflake

 

I happen to know that there are 193 records across the tables that match the criteria (Table names start with DB_). Can you find where we did that??

After a few seconds, the tables have be created and the data has been migrated.  Let’s check it out.

A review of the History tab in Snowflake shows the process StreamSets executes to:

  1. Create the required file format
  2. Create the table
  3. Use the put command to load the data into the internal stage
  4. Executes the copy command to populate the target table

 

 

In the Snowflake worksheet execute the following:

This should give you row counts for each table that was migrated.

 

Here is the same statement run from SQL Developer. All counts match, so the migration was successful.

This was a very simple example of using  a StreamSets to migrate multiple tables from a source database to Snowflake.

Enjoy!dbaontap

 

 

 

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.