Enabling Change Data Capture (CDC) with Solace and StreamSets

Home > Blog > For Developers

There’s quite a bit of buzz about change data capture (CDC) these days! I have been in numerous client meetings where CDC has come up lately, way more than before. So, what is CDC, why might you be interested in it, and what does it have to do with event-driven architecture? I’ll explain!

Introducing Change Data Capture (CDC)

A popular use case for CDC is replicating databases across datacenters. This is becoming increasingly common as companies look to adopt cloud and a hybrid-cloud environment. In this way CDC is closely linked to event-driven architecture (EDA) which is why CDC is becoming more common as businesses realize the importance of event-enabling their architecture to meet customer demand and improve overall customer experience with EDA. It is in their interest to ensure data is captured, analyzed, and acted upon in real-time. Gone are the days of batch processing where valuable data would sit for hours or days and lose its value over time. In this brave new world, data is analyzed in real-time while it is most valuable. A flight passenger has no use for an alert notifying them of a flight delay 5 hours later.

Coming back to CDC: It is a design pattern used to identify changes in data so that actions can be taken on those changes in real time. Databases are everywhere and we are all used to storing plenty of data in databases where it rests. We frequently query this data as well as watch it change over time. Records get inserted, deleted and updated. And while the new information provides very valuable information, there is additional value in tracking changes to these database objects in real-time.

Imagine a Customers table which tracks all of the active customers who are currently using a software company’s products. As more and more customers purchase more products over time, new rows are inserted and this table grows. Occasionally, customer information needs to be updated, such as their contact information, which leads to rows being updated in the Customers table.

While this is all very valuable information, it is still static and shows you a point-in-time view of the. You can query the table to get the contact information for a customer, but it doesn’t tell you when and how this information has changed over time. This is where CDC comes in. Its goal is to event-enable databases by turning all databases changes into events. Bingo!

With events, we can respond to them in real time and build downstream pipeline that can react to these events. For example, as soon as a new customer is added, we would like to send a thank-you email to them. This can be done via CDC. A new row is inserted for the new customer which is turned into an event, published to an event broker, subscribed to by some number of downstream processes, one of which is responsible for sending a thank-you email.

Let’s see what this looks like with an architecture diagram!

In this diagram, we have different databases where our records might be stored. We can event-enable these databases using CDC connectors provided by StreamSets. StreamSets is an enterprise data integration platform with multiple CDC connectors to databases such as Microsoft SQL Server, Oracle, MySQL, and PostgreSQL. StreamSets will generate events and publish them to Solace’s PubSub+ Event Broker.

PubSub+ is an enterprise grade event broker widely deployed by companies across industries to event-enable their systems. It supports open APIs and multiple protocols which makes it extremely easy to integrate with other technologies without any additional proxies. Once an event is published to a PubSub+ event broker, it can be independently consumed by multiple consumers. You can have a microservice using Java API to process the event, enrich it and then write it to Salesforce. Events can also be pushed out to downstream services such as AWS’s API Gateway via REST webhooks (all natively built into Solace PubSub+) and then consumed by additional AWS services such as Lambda. Finally, you can leverage Apache Spark for analytics.

Setting up the Components for CDC + EDA

Now that we know how the different components work, let’s test it out! I will show how you can event-enable your Microsoft SQL Server leveraging Streamsets’ CDC connector and Solace PubSub+ Event Broker.

For this demo, I chose StreamSets as my CDC source, but this is one of several options for CDC such as Qlik Replicate, Oracle Golden Gate, Striim, and Debezium, to name a few.

Microsoft SQL Server

I set up a dev instance of MS SQL Server on an AWS EC2 instance via docker. I followed instructions from here. Basically, set up Docker and run the following commands:

docker pull mcr.microsoft.com/mssql/server:2019-latest

docker run -d --name example_sql_server -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=Strong.Pwd-123' -e MSSQL_AGENT_ENABLED=True -p 1433:1433 mcr.microsoft.com/mssql/server:2019-latest

Once you have the docker container running, you will need to install MS SQL CLI. There are different ones available for the OS you are using. For an EC2 instance, you can install the CLI using this command:

yum install mssql-tools unixODBC-devel

Once it’s installed, you can log in:

sqlcmd -U sa -P Strong.Pwd-123 -H <hostname>

Next, we will create a database called demo and a table called Persons.

1> create database demo; 
2> go 
1> CREATE TABLE Persons ( 
2>.   PersonID int, 
3>    LastName varchar(255), 
4>    FirstName varchar(255), 
5>    Address varchar(255), 
6>    City varchar(255) 
7> );

Now that the db and table have been created, we need to enable them for CDC so that MS SQL Server knows we want changes to be tracked. You have to enable CDC at database level and then at table level.

# CDC for DB

USE demo
GO
EXEC sys.sp_cdc_enable_db
GO

# CDC for Persons table
USE demo
GO
EXEC sys.sp_cdc_enable_table @source_schema = N'dbo',
@source_name = N'Peresons',
@role_name = NULL,
@filegroup_name = NULL,
@supports_net_changes = 0
GO

That’s it for MS SQL for now. Next, we move on to StreamSets.

StreamSets

There are different ways to deploy StreamSets. For this demo, I am using their free-tier SaaS offering which you can sign up for here. There is a lot of tutorials available to learn how you can build a pipeline using different input sources and destinations. You can even use components to transform payloads. For our demo, though, we will keep it simple and publish the event as it is to Solace PubSub+.

Click on Set Up >>Deployments and create a new Deployment with the following additional libraries: JMS, and SQL Server 2019 Big Data Cluster Enterprise Library. Follow rest of the instructions to get to the docker installation instructions. For example:

docker run -d -e STREAMSETS_DEPLOYMENT_SCH_URL=https://na01.hub.streamsets.com -e STREAMSETS_DEPLOYMENT_ID=b62044de-16b8-42e5-9d68-c2ed632e86b0:7bc55376-2098-11ec-a5fe-ffa35ee1d0b0 -e STREAMSETS_DEPLOYMENT_TOKEN=eyJ0eXAiOiJKV1QiLCJhbGciOiJub25lIn0.eyJzIjoiYTJjYmI4YjAzMTJhNTFmZWIxNGEzMTNjMzhkOWE2ZDg5YzljZDk0MzYzOTM5ZGFkZjAzNjFiYmVmNDQ3MDFlMjFlMDU4MjJjZGEwNjI0ZWQ0MTg2Y2M5ODA1YjhhN2M0NjVmOTEzOWIyMDBmMmJiMmQ2MGEwN2FlNmIwMmM2YzciLCJ2IjoxLCJpc3MiOiJuYTAxIiwianRpIjoiN2Q4ZmJiYTEtYWUyYS00NDhlLWIxNTYtZWU3ZWIwZmI4MTA2IiwibyI6IjdiYzU1Mzc2LTIwOTgtMTFlYy1hNWZlLWZmYTM1ZWUxZDBiMCJ9. streamsets/datacollector:4.1.0

Run the docker command on your local machine.

Once you have it running, go back to StreamSets and click on Set Up >> Engines. You will see an engine running here:

Click on that engine and click on External Resources on the following page. This is where you can upload additional libraries/resources for your data collector. We will need to upload Solace’s JMS library here to be able to use StreamSet’s JMS Producer.

Download Solace’s JMS API from their downloads page, unzip it, and navigate to lib directory. You will find the necessary jars there.

Go back to StreamSets, click on External Libraries and click on the + sign.

Select JMS (streamsets-datacollector-jms-lib) from the dropdown menu next to Stage Library.

Next, upload all the jars that you just downloaded.

You will be asked to restart the engine/collector for changes to take effect.

Now it’s time to build a pipeline. Go to Build and click on Pipelines. Create a new pipeline. You can then add components to it. Here is what your pipeline needs to look like:

It has two components: SQL Server CDC Client and JMS Producer.

Configure the SQL Server CDC Client with the following properties (keep others as default):

JDBC:
Connection: None
JDBC Connection String: jdbc:sqlserver://<host>:<port>;database=demo
Fetch Size: 255
Max Batch Size (Records): 255

CDC:
Capture Instance Name: %

Credentials:
Username: sa
Password: Strong.Pwd-123

Next, configure the JMS Producer:

JMS: 
Connection: None 
JMS Initial Context Factory: com.solacesystems.jndi.SolJNDIInitialContextFactory 
JNDI Connection Factory: /jms/cf/default 
JMS Provider URL: tcp://<pubsub_broker_host>:55555 
JMS Destination Name: cdc/demo/sqlserver 
JMS Destination Type: Topic 

Additional JMS property: 
Solace_VPN_NAME: <pubsub_vpn_name> 

Credentials: 
Username: <pubsub_username> 
Password: <pubsub_password> 

Data Format: 
Data Format: JSON 
JSON Content: Multiple JSON Objects

Now that our pipeline is ready, it is time to set up our Solace PubSub+ Event Broker.

Solace PubSub+ Event Broker

Just like StreamSets, there are numerous ways to deploy PubSub+ Event Broker. Pick the one that’s the easiest for you. I prefer to spin up a free instance using my trial account on PubSub+ Cloud. You can sign up for an account here.

Once you have a service running, go to the “Try Me!” tab and connect to the Subscriber app. Here we will subscribe to the topic that our JMS Producer was configured to publish events to in the previous step: cdc/demo/sqlserver.

PubSub+ supports dynamic hierarchical topics which allow subscribers to filter events using wildcards. For example, I might have multiple CDC sources (Oracle, MS SQL, etc.) publishing to different topics such as cdc/demo/. I might have one downstream subscriber subscribing to cdc/*/sqlserver and another subscribing to cdc/*/oracle. You can learn more about Solace topics here.

Demo Time!

Phew, now that we have everything set up, it’s demo time! Let’s run our StreamSets pipeline by clicking on Test Run on the top-right side and then clicking on Start Pipeline.

Now, go back to your SQL Server instance and run a SQL statement to insert a row.

Insert into Persons Values ('007', 'Bond', 'James', 'Main St', 'London')

(Side note: I cannot wait to watch the latest Bond movie No Time To Die!)

Soon after running that statement, you will see some data in the Realtime Summary section of StreamSets.

It shows that there was 1 Input and 1 Output which means our change (inserting row) was picked up by our pipeline! Awesome!

Now, let’s go to our PubSub+ broker and see if our Subscriber picked up a message. And, of course, it did!

2021-09-29 18:03:20:847 [Topic cdc/demo/sqlserver]
Delivery Mode: Direct
Sender Timestamp: 2021-09-29 18:03:20:840
Priority: 4
Destination: [Topic cdc/demo/sqlserver]
AppMessageID: ID:172.17.0.3936317c32cd41500:7
SendTimestamp: 1632953000840 (Wed Sep 29 2021 18:03:20 GMT-0400 (Eastern Daylight Time))

Class Of Service: COS1
DeliveryMode: DIRECT
Message Id: 2

XML:	                  len=88	
7b 22 50 65 72 73 6f 6e   49 44 22 3a 37 2c 22 4c   {"PersonID":7,"L
61 73 74 4e 61 6d 65 22   3a 22 42 6f 6e 64 22 2c   astName":"Bond",
22 46 69 72 73 74 4e 61   6d 65 22 3a 22 4a 61 6d   "FirstName":"Jam
65 73 22 2c 22 41 64 64   72 65 73 73 22 3a 22 4d   es","Address":"M
61 69 6e 20 53 74 22 2c   22 43 69 74 79 22 3a 22   ain.St","City":"
4c 6f 6e 64 6f 6e 22 7d                             London"}

As you can see in the payload, it picked up the new row that was inserted into the database. You can use this event as a trigger for numerous downstream processes. One such process can be one that sends a Welcome email to new customers. I am sure James Bond would love that!

Conclusion

I hope I’ve helped you understand how you can leverage StreamSets and Solace PubSub+ to build a CDC pipeline that event-enables your databases! If you have any questions or comments about this blog post or topic, post them to the Solace Developer Community.

Explore other posts from category: For Developers

Himanshu Gupta

As one of Solace's solutions architects, Himanshu is an expert in many areas of event-driven architecture, and specializes in the design of systems that capture, store and analyze market data in the capital markets and financial services sectors. This expertise and specialization is based on years of experience working at both buy- and sell-side firms as a tick data developer where he worked with popular time series databases kdb+ and OneTick to store and analyze real-time and historical financial market data across asset classes.

In addition to writing blog posts for Solace, Himanshu publishes two blogs of his own: enlist[q] focused on time series data analysis, and abitdeployed.com which is about general technology and latest trends. He has also published code samples at GitHub and kdb+ tutorials on YouTube!

Himanshu holds a bachelors of science degree in electrical engineering from City College at City University of New York. When he's not designing real-time market data systems, he enjoys watching movies, writing, investing and tinkering with the latest technologies.

See all posts written by Himanshu Gupta

Join Our Developer Community

Join the Solace Developer Community to discuss and share PubSub+ API hints, new features, useful integrations, demos, and sample code!

JOIN THE DISCUSSION