Data Warehouse migration from Teradata to Google BigQuery

Hil Liao
4 min readMay 22, 2023

Recently, I have done a few database migration test runs to refresh my memory of working on large scale cloud migration projects in the past which included data warehouse migration. Google Cloud database migration service has been quite mature. It took me about 2 hours to set up migrating from self-hosted PostgreSQL to Cloud SQL for Postgres. I also configured DataStream to capture CDC from the same self-hosted PostgreSQL on KVM. I believe Database migration service (DMS) and DataStream use similar technologies under the hood but DMS does not replicate DDL changes such as adding a column or a table by default while datastream does.

Skipping the details of traditional RMDB migrations, I want to write about migrating a data warehouse that’s common among enterprises to Google Cloud. Teradata is only a mid cap company with about 8500 employees. I’m surprised to see their data warehouse product to be so antiquated and outdated. I talked to their recruiting team in November 2016 about a software engineering job. I challenged some of their interview question’s validity and the interviewer appeared to be arrogant and selfish. At the time, I only knew Teradata employees were distasteful but did not know their products were even worse.

Get Set Up

The 1st step was already hard, just like their product’s usability. You’d need to download a trial version of Teradata data warehouse called Vantage Express 17.2. I had to create an account to download the binary installation files. The closest down types were VMWare .vmdk files and Virtualbox .ova files. I tried converting them to KVM’s .qcow2 format with qemu-img comamnd resulting 3 .qcow2 files for a KVM instance. The end result was the KVM instance could start but watch pdestate -a never reached the started state. I had to fall back to VirtualBox installation method which has been known for terrible performance. I followed the link to install on a KVM instance of 4 vCPU and 4GB of RAM. The virtual box VM had 3 vCPU and 3GB RAM and was based on SuSE Linux which I believe was unpopular and badly supported. There was a Start Teradata button on the desktop to launch in case the database’s state was stuck in DOWN after reboot.

Teradata in Virtualbox on KVM

I configured KVM to use a network bridge See Step 3: Create network bridge (br0) to make the KVM instance visible on the LAN. I had to configure virtual box to use the KVM instance’s network interface to make Teradata visible in the LAN. See the image above for the network adapter configuration where enp1s0 is the network interface on the KVM instance. Record the IP on the Virtualbox Teradata VM via command ip a. You can ssh directly to it from your other hosts in the LAN to test.

Somehow Teradata studio express failed to start on the Virtualbox VM. I tried installing it on the KVM instance but it only has rpm packages. I was running Ubuntu 23.04 so it couldn’t install. I already had a Windows 10 KVM instance ready so I installed Teradata studio express there.

This shows other hosts in the LAN can connect to the Teradata VirtualBox VM’s IP. That makes the steps in Teradata to BigQuery migration viable. The next section is about the migration steps in Google Cloud BigQuery Data Transfer service.

Start the migration

When I created the data transfer config, I set on-demand as the schedule. One catch was that you need to manually start it when you executed the migration agent. Otherwise, you’d get empty output after entering the default password dbc when you started the migration agent. I chose to use JDBC connection instead of tbuild to extract data.

java -cp /mnt/1tb/software/teradata/TeradataToolsAndUtilitiesBase__ubuntu_x8664.17.20.14.00/terajdbc4.jar\
:/mnt/1tb/software/teradata/mirroring-agent.jar com.google.cloud.bigquery.dms.Agent \
--configuration-file=/home/hil/Documents/teradata-migration-agent-config.json
Enter username: dbc
Enter password:

After entering the password, I saw nothing for a minute. Then I manually started the transfer job by clicking the RUN TRANSFER NOW button. The gray status would turn green in the transfer job in the screenshot below:

Once you start the transfer, the command would show the following content:

Waiting for transfer run..Agent ID: 6156169f-51ff-49d9-a92d-b1826dc64e28
Listing tables in [HR] databases - DONE
Extracting table 'HR.orders' - DONE
Extracting table 'HR.Employees' - DONE
Waiting for transfer run..

The last line is waiting for the next transfer job to start. you can Ctrl+C to break out of it if you don’t want it to run. Verify the dataset chosen in the configuration has the tables: orders and Employees. That’s a wrap.

--

--