Snowflake Migration using StreamSets

#OnTapToday, using @StreamSets to migrate data from your source database to Snowflake. In this post, I will walk through the process of configuring StreamSets to migrate data from my on prem DB to Snowflake.

WHY

You need to get some data from a source database to the Snowflake data warehouse. There were a few options at my disposal. I could have:

done a simple export of the data to csv and used snowSQL to load the data to a stage and ingested the data that way
take a look at two of my previous posts, Deploy node-oracledb with Official Instant Client Image and Connect to Snowflake with Node.js Driver to figure out some cool programmatic way to get this data over or
look at something different such as StreamSets

StreamSets operates on a Hybrid Open–Source Model (HOSS). StreamSets Data Collector is open source, released under the Apache 2.0 license. The remainder of the platform source code is proprietary and not available to customers

In this case, I chose StreamSets.

SETUP

The setup was actually quite complex on my Macbook as there were some hacks necessary on the OS to allow adjustments to the java heap size. Rumor has it, this is just an issue with Mac OS, and other linux OSes are fine. Fortunately, all this can be found with some google searches.

Here are the installation instructions:

Download, install, and run Data CollectorRemember on a Mac there are some additional OS configurations necessary to get around the heap size issue. I’ll try to locate the link that helped me out and update the post.
- Once you have the Data Collector installed properly, you need to launch it with the following command:
  - bin/streamsets dc
Launch the Data Collector, via the URL presented after the service starts and click on the Package Manager icon.

In the search box, enter “Snowflake”, select Snowflake Enterprise Library 1.1.0 from the list, click Install icon and accept Terms of Service*. Once the installation completes, follow instructions to restart Data Collector.

After the Data Collector restarts, we can create the data migration pipeline.

Since this a simple multi-table migration, we can build this pipeline by connecting a single Origin and Destinations. In this case, we will use the JDBC Multitable Consumer origin and the Snowflake Replicate destination.

From the Data Collector GUI, select Origin in the upper right corner and enter JDBC in the search box. Double click the JDBC Multitable Consumer icon to place it in the workspace. In this example, to make things simpler, I will be connecting to an already existing Oracle database running in my basement.

Next select Destinations from the same dropdown in the upper right and search for Snowflake. Again double click the Snowflake icon to place it on the workspace.

Finally connect the JDBC Origin and the Snowflake Destination by clicking and dragging a connector between the two.

As you can see, there are a number of tab associated with the two phases of this migration pipeline. Complete these with the appropriate values.

JDBC Multitable Consumer:

JDBC

JDBC Connection String: jdbc:oracle:thin:user/password@10.1.1.45:1521/snowflake_pdb 
Use Credentials: 
 Queries Per Second: 10 
Number of Threads: 1 
Per Batch Strategy: Switch Tables 
Max Batch Size (Records): 1000 
Batches from Result Set: -1 
Result Set Cache Size: -1 
Max Clob Size (Characters): 1000 
Max Blob Size (Bytes): 1000 
Number of Retries on SQL Error: 0 
Data Time Zone: +00:00 UTC (UTC)
 No-more-data Event Generation Delay (seconds): 0
 Quote Character: NONE 
Convert Timestamp To String:  
Fetch Size: 1000
 Additional JDBC Configuration Properties

Tables:

Schema: SNOWFLAKE 
Table Name Pattern: DB_%  
Table Exclusion Pattern  
Schema Exclusion Pattern  
Override Offset 
Columns  Initial Offset 
 Enable Non-Incremental Load  
Multithreaded Partition 
Processing Mode: Off 
Offset Column Conditions

Leave all other tabs with any defaults

Snowflake

Snowflake Connections Info

Snowflake Region: AWS - US West (default) 
Account: ••••••••• 
User: ••••••• 
Password: ••••••••• 
Connection Pool Size: 1

Snowflake

Warehouse: DB_TESTWH 
Database: STREAMSETS_DEMO
 Schema: SSDEMO
 Table: ${record:attribute('jdbc.tables')} 
Use Snowpipe  
Upper Case 
Schema & Field Names: X 
Data Drift Enabled: X 
Table Auto Create: X  
Create New Columns as VARCHAR: X

Staging

Stage Location: Snowflake Internal Stages 
Stage Database: STREAMSETS_DEMO 
Stage Schema: SSDEMO 
Snowflake Stage Name: ssdemo_stage 
Purge Stage File After Ingesting: X 
Local File Prefix: sdc
 Compressed File: X

Leave all other tabs with any defaults

Now let’s switch over and set up the Snowflake data warehouse. If you do not have an account, you can visit this link to start a free trial. Once you have successfully logged in, simply copy and paste this following set of into the worksheet.

use role sysadmin; --  a role to create the necessary database object  
create or replace database streamsets_demo; -- Create a database for the StreamSets Demo  
use database streamsets_demo;  
create or replace schema ssdemo; -- Create a schema for the StreamSets Demo 
use schema ssdemo;  
use warehouse DB_TESTWH; -- Use a warehouse. In this case an XS  
create or replace stage ssdemo_stage; -- Create an internal stage to land the data from Oracle prior to loading the tables  

-- STOP HERE. The following lines of code will be run after the migration. 
-- Get row counts from tables 
select 'DB_COUNTRY', count(*) from DB_COUNTRY union
select 'DB_DEPARTMENTS', count(*) from DB_DEPARTMENTS union
select 'DB_EMPLOYEES', count(*) from DB_EMPLOYEES union
select 'DB_JOBS', count(*) from DB_JOBS union
select 'DB_JOB_HISTORY', count(*) from DB_JOB_HISTORY union
select 'DB_REGIONS', count(*) from DB_REGIONS;

  -- This demo is based in part on the Oracle HR schema.  
DROP TABLE COUNTRY; 
DROP TABLE DB_COUNTRY; 
DROP TABLE DB_DEPARTMENTS; 
DROP TABLE DB_EMPLOYEES; 
DROP TABLE DB_JOBS; 
DROP TABLE DB_JOB_HISTORY; 
DROP TABLE DB_REGIONS;

We are now ready to run the StreamSets job. You will notice that we did not create any tables in the Snowflake data warehouse. One of the options we select in the StreamSets configuration was to automatically create the tables we are migrating.

Migrate

Back on the StreamSets Data Collector interface, click Start in the upper right.

The process will begin migrating the data from the source database, in this case, Oracle to Snowflake

I happen to know that there are 193 records across the tables that match the criteria (Table names start with DB_). Can you find where we did that??

After a few seconds, the tables have be created and the data has been migrated. Let’s check it out.

Two of my friends are taking phentermine, and they describe their results as great, like 5 pounds in a week, so I decided to also try it. Can you tell me if https://www.philipsanimalgarden.com/phentermine-online/ has any special offers right now for buying this drug? I would love to buy just one pack but I am not sure what the delivery costs.

A review of the History tab in Snowflake shows the process StreamSets executes to:

Create the required file format
Create the table
Use the put command to load the data into the internal stage
Executes the copy command to populate the target table

In the Snowflake worksheet execute the following:

select 'DB_COUNTRY', count(*) from DB_COUNTRY union 
select 'DB_DEPARTMENTS', count(*) from DB_DEPARTMENTS union 
select 'DB_EMPLOYEES', count(*) from DB_EMPLOYEES union 
select 'DB_JOBS', count(*) from DB_JOBS union 
select 'DB_JOB_HISTORY', count(*) from DB_JOB_HISTORY union 
select 'DB_REGIONS', count(*) from DB_REGIONS;

This should give you row counts for each table that was migrated.

Here is the same statement run from SQL Developer. All counts match, so the migration was successful.

This was a very simple example of using a StreamSets to migrate multiple tables from a source database to Snowflake.

Enjoy!

dbaonTap

Snowflake Migration using StreamSets

WHY

SETUP

Migrate

Like this:

Related

Leave a ReplyCancel reply

WHY

SETUP

Migrate

Share this:

Like this:

Related

Leave a ReplyCancel reply