CDCSDK Server Beta
Yugabyte CDCSDK Server is an open source project that provides a streaming platform for change data capture from YugabyteDB. The server is based on Debezium. CDCSDK Server uses debezium-yugabytedb-connector to capture change events. It supports a YugabyteDB instance as a source and supports the following sinks:
- Kafka
- HTTP REST Endpoint
- AWS S3
On this page
Basic Architecture
Engine
A Debezium Engine implementation is the unit of work. It implements a pipeline consisting of a source, sink, and simple transforms. The only supported source is YugabyteDB. The source is assigned a set of tablets that is polled at a configurable interval. An engine’s workflow is as follows:
- Connect to the CDCSDK stream.
- Get a list of tables and filter based on the include list.
- Get and record a list of tablets.
- Poll tablets in sequence every polling interval.
Server
The CDCSDK server hosts a Debezium Engine. The implementation is based on the Debezium Server. It uses the Quarkus framework and extensions to provide a server shell, metrics, and alerts. By default, a server runs one Engine implementation in a thread. A server can also run in multi-threaded mode wherein multiple engines are assigned to a thread each. The server splits tablets into groups in a deterministic manner. Each group of tablets is assigned to an Engine.
Quick Start
Create a CDCSDK stream in YugabyteDB
Use yb-admin to create a CDC stream. If successful, the operation returns the stream ID; note the ID, as it is used in later steps. For example:
CDC Stream ID: d540f5e4890c4d3b812933cbfd703ed3
Download and run CDCSDK Server
Download CDCSDK Server from the GitHub project Releases page. Each releases includes a tar.gz file named CDCSDK Server.
The archive has the following layout:
cdcsdk-server
|-- conf
|-- debezium-server-<CDCSDK-VERSION>.Final-runner.jar
|-- lib
|-- run.sh
Unpack and run the application
export CDCSDK_VERSION=<x.y.z>
wget https://github.com/yugabyte/cdcsdk-server/releases/download/v${CDCSDK_VERSION}/cdcsdk-server-dist-${CDCSDK_VERSION}.tar.gz
tar xvf cdcsdk-server-dist-${CDCSDK_VERSION}.tar.gz
cd cdcsdk-server
# Configure the application. More details in the next section
touch conf/application.properties
# Run the application
./run.sh
Configuration
The main configuration file is conf/application.properties
, which includes the following sections:
cdcsdk.source
is for configuring the source connector.cdcsdk.sink
is for the sink system configuration.cdcsdk.transforms
is for the configuration of message transformations.
Configure using environment variables
Using environment variables for configuration can be useful when running in containers. The rule of thumb is to convert the keys to UPPER CASE and replace .
with _
. For example, change cdcsdk.source.database.port
to CDCSDK_SOURCE_DATABASE_PORT
.
Server configuration
Property | Default | Description |
---|---|---|
cdcsdk.server.transforms |
Transformations to apply. |
Additional configuration:
Property | Default | Description |
---|---|---|
quarkus.http.port |
8080 | The port on which CDCSDK Server exposes Microprofile Health endpoint and other exposed status information. |
quarkus.log.level |
INFO | The default log level for every log category. |
quarkus.log.console.json |
true | Determines whether to enable the JSON console formatting extension, which disables "normal" console formatting. |
Source configuration
The cdcsdk.source
configurations are nothing but the Debezium Connector's configurations only where you can specify the cdcsdk.source
as a prefix to any of the connector's configurations. For a complete list of the Debezium configurations, see Debezium Connector for YugabyteDB configurations.
Sample Configuration when YugabyteDB is started on a local machine:
cdcsdk.source.connector.class=io.debezium.connector.yugabytedb.YugabyteDBConnector
cdcsdk.source.database.hostname=127.0.0.1
cdcsdk.source.database.port=5433
cdcsdk.source.database.user=yugabyte
cdcsdk.source.database.password=yugabyte
cdcsdk.source.database.dbname=yugabyte
cdcsdk.source.database.server.name=dbserver1
cdcsdk.source.database.streamid=de362081fa864e94b35fcac6005e7cd9
cdcsdk.source.table.include.list=public.test
cdcsdk.source.database.master.addresses=127.0.0.1:7100
cdcsdk.source.snapshot.mode=never
Apache Kafka
The Kafka sink adapter supports pass-through configuration. This means that all Kafka producer configuration properties are passed to the producer with the prefix removed. At least bootstrap.servers, key.serializer and value.serializer properties must be provided. The topic is set by CDCSDK Server.
Example Configuration:
cdcsdk.sink.type=kafka
cdcsdk.sink.kafka.producer.bootstrap.servers=<BOOTSTRAP-SERVERS>
cdcsdk.sink.kafka.producer.key.serializer=org.apache.kafka.common.serialization.StringSerializer
cdcsdk.sink.kafka.producer.value.serializer=org.apache.kafka.common.serialization.StringSerializer
Confluent Cloud
Confluent Cloud deployment of Kafka, also requires SSL configuration. Example configuration:
cdcsdk.sink.type=kafka
cdcsdk.sink.kafka.producer.bootstrap.servers=<BOOTSTRAP-SERVERS>
cdcsdk.sink.kafka.producer.key.serializer=org.apache.kafka.common.serialization.StringSerializer
cdcsdk.sink.kafka.producer.value.serializer=org.apache.kafka.common.serialization.StringSerializer
cdcsdk.sink.kafka.security.protocol=SASL_SSL
cdcsdk.sink.kafka.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username='USERNAME' password='PASSWORD';
cdcsdk.sink.kafka.sasl.mechanism=PLAIN
cdcsdk.sink.kafka.producer.security.protocol=SASL_SSL
cdcsdk.sink.kafka.producer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username='USERNAME' password='PASSWORD';
cdcsdk.sink.kafka.producer.sasl.mechanism=PLAIN
cdcsdk.sink.kafka.producer.ssl.endpoint.identification.algorithm=https
cdcsdk.sink.kafka.client.dns.lookup=use_all_dns_ips
cdcsdk.sink.kafka.session.timeout.ms=45000
cdcsdk.sink.kafka.acks=all
HTTP Client
The HTTP client streams changes to any HTTP server for additional processing, with the goal of making Debezium act as a native event source.
Property | Default | Description |
---|---|---|
cdcsdk.sink.type |
Must be set to http |
|
cdcsdk.sink.http.url |
The HTTP Server URL to stream events to. This can also be set by defining the K_SINK environment variable, which is used by the Knative source framework. | |
cdcsdk.sink.http.timeout.ms |
60000 | The number of milli-seconds to wait for a response from the server before timing out. |
Amazon S3
The Amazon S3 Sink streams changes to an AWS S3 bucket. Only Inserts are supported. The available configuration options are:
Property | Default | Description |
---|---|---|
cdcsdk.sink.type |
Must be set to s3 . |
|
cdcsdk.sink.s3.bucket.name |
Name of S3 bucket. | |
cdcsdk.sink.s3.region |
Name of the region of the S3 bucket. | |
cdcsdk.sink.s3.basedir |
Base directory or path where the data has to be stored. | |
cdcsdk.sink.s3.pattern |
Pattern to generate paths (sub-directory and filename) for data files. | |
cdcsdk.sink.s3.flush.sizeMB |
200 | Trigger Data File Rollover on file size. |
cdcsdk.sink.s3.flush.records |
10000 | Trigger Data File Rollover on number of records |
Note
Amazon S3 Sink supports a single table at a time. Specificallycdcsdk.source.table.include.list
should contain only one table at a time. If multiple tables need to be exported to Amazon S3, set up multiple CDCSDK servers that read from the same CDC Stream ID but write to different S3 locations.
Mapping records to S3 objects
The Amazon S3 Sink only supports create events in the CDC Stream. It writes payload.after
fields to a file in S3.
The filename in S3 is generated as ${cdcsdk.sink.s3.basedir}/${cdcsdk.sink.s3.pattern}
. Pattern can contain placeholders to customize the filenames, as follows:
- {YEAR}: Year in which the sync was writing the output data in.
- {MONTH}: Month in which the sync was writing the output data in.
- {DAY}: Day in which the sync was writing the output data in.
- {HOUR}: Hour in which the sync was writing the output data in.
- {MINUTE}: Minute in which the sync was writing the output data in.
- {SECOND}: Second in which the sync was writing the output data in.
- {MILLISECOND}: Millisecond in which the sync was writing the output data in.
- {EPOCH}: Milliseconds since Epoch in which the sync was writing the output data in.
- {UUID}: Random uuid string.
For example, the following pattern can be used to create hourly partitions with multiple files, each of which is no greater than 200MB:
{YEAR}-{MONTH}-{DAY}-{HOUR}/data-{UUID}.jsonl
IAM Policy
The AWS user account accessing the S3 bucket must have the following permissions:
- ListAllMyBuckets
- ListBucket
- GetBucketLocation
- PutObject
- GetObject
- AbortMultipartUpload
- ListMultipartUploadParts
- ListBucketMultipartUploads
Copy the following JSON to create the IAM policy for the user account. Change to a real bucket name. For more information, see Create and attach a policy to an IAM user.
Note: This is the IAM policy for the user account and not a bucket policy.
{
"Version":"2012-10-17",
"Statement":[
{
"Effect":"Allow",
"Action":[
"s3:ListAllMyBuckets"
],
"Resource":"arn:aws:s3:::*"
},
{
"Effect":"Allow",
"Action":[
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource":"arn:aws:s3:::<bucket-name>"
},
{
"Effect":"Allow",
"Action":[
"s3:PutObject",
"s3:GetObject",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts",
"s3:ListBucketMultipartUploads"
],
"Resource":"arn:aws:s3:::<bucket-name>/*"
}
]
}
Record structure
By default, the YugabyteDB connector generates a complex record in JSON with key and value information including payload. A sophisticated sink can use the information to generate appropriate commands in the receiving system.
Simple sinks expect simple key/value JSON objects, where key is the column name and value is the contents of the column. For simple sinks, set cdcsdk.server.transforms=FLATTEN
. With this configuration, the record structure will only emit the payload as simple JSON.
With FLATTEN
, the following simple format is emitted:
{
"id":...,
"first_name":...,
"last_name":...,
"email":...
}
Operations
Topology
- A universe can have multiple namespaces.
- Each namespace can have multiple CDCSDK streams.
- Each CDCSDK stream can have multiple servers associated with it. Default is 1. The group of multiple servers associated with a stream is called a ServerSet.
Networking
A CDCSDK Server requires access to open ports in YugabyteDB. Therefore it has to run in the same VPC (or peered VPC) as the YugabyteDB database. The server also requires access to sinks in the case of Kafka or an HTTP REST Endpoint and the appropriate credentials for writing to AWS S3.
Health checks
CDCSDK Server exposes a simple health check REST API. Currently the health check only ensures that the server is up and running.
Running the health checks
The following REST endpoints are exposed:
/q/health/live
- The application is up and running./q/health/ready
- The application is ready to serve requests.
All of the health REST endpoints return a simple JSON object with two fields:
status
— The overall result of all the health check procedures.checks
— An array of individual checks.
The general status of the health check is computed as a logical AND of all the declared health check procedures.
curl http://localhost:8080/q/health/live
{
"status": "UP",
"checks": [
{
"name": "debezium",
"status": "UP"
}
]
}
curl http://localhost:8080/q/health/ready
{
"status": "UP",
"checks": [
]
}