Data Synchronization Patterns

**Referenced Files in This Document** - [[consumer.go]](file/bi-common/mq/kafkax/consumer.go) - [[producer.go]](file/bi-common/mq/kafkax/producer.go) - [[consume.go]](file/bi-common/mq/kafkax/consume.go) - [[producer_tx.go]](file/bi-common/mq/kafkax/producer-tx.go) - [[retry.go]](file/bi-common/mq/kafkax/retry.go) - [[config.go]](file/bi-common/mq/kafkax/config.go) - [[env.go]](file/bi-common/mq/kafkax/env.go) - [[kafkax-default.yaml]](file/bi-common/mq/kafkax/kafkax-default.yaml) - [[multi_consumer.go]](file/bi-common/mq/kafkax/multi-consumer.go) - [[producer_test.go]](file/bi-common/mq/kafkax/producer-test.go) - [[streamload.go]](file/bi-api-jushuitan/internal/data/client/streamload.go) - [[streamload.go]](file/bi-basic/internal/data/client/streamload.go) - [[streamload.go]](file/bi-api-leke/internal/data/client/streamload.go) - [[response.go]](file/bi-common/database/starrocks/streamload/response.go) - [[README.md]](file/bi-common/database/starrocks/streamload/readme.md) - [[data-sync-specialist.md]](file/.factory/droids/data-sync-specialist.md) - [[data-sync-design.md]](file/ui-web-docs/pages/zh/xcbi-dev/.md) - [[message-model.md]](file/ui-web-docs/pages/zh/xcbi-dev/.md) - [[backup-recovery.md]](file/ui-web-docs/pages/zh/xcbi-dev/.md)

Introduction
Project Structure
Core Components
Architecture Overview
Detailed Component Analysis
Dependency Analysis
Performance Considerations
Troubleshooting Guide
Conclusion
Appendices

Introduction

This document describes the data synchronization patterns used across external integrations in the BI project, focusing on an event-driven architecture powered by Apache Kafka. It covers incremental synchronization strategies using timestamp-based filtering and change data capture, idempotent processing to handle duplicate messages, backpressure handling and batch processing optimizations, message serialization and partitioning strategies, consumer group management, dead letter queue patterns for failed message processing, and monitoring approaches for synchronization latency, throughput, and data freshness. It also documents rollback and recovery procedures for synchronization failures.

Project Structure

The synchronization stack is built around reusable Kafka components in the shared library and specialized clients for downstream data loading (StarRocks Stream Load). The key areas are:

Kafka producer/consumer abstractions and configuration
Transactional producer support for atomic writes
Retry and dead letter queue handling
Multi-topic consumer orchestration
StarRocks Stream Load integration for bulk ingestion

Diagram sources

Section sources

Core Components

Kafka Producer: Provides synchronous and asynchronous messaging with batching, compression, and pluggable partitioning strategies. Supports JSON serialization helpers and header injection for cross-cutting concerns.
Kafka Consumer: Implements continuous consumption with manual and automatic commit modes, batch consumption, graceful shutdown, and health checks. Includes integrated retry logic and dead letter queue forwarding.
Transactional Producer: Enables idempotent and atomic writes across steps or topics, ensuring exactly-once semantics where required.
Retry and DLQ: Configurable exponential backoff with jitter, maximum retry limits, and automatic forwarding to a dead letter topic for poisoned messages.
Multi-Topic Consumer: Manages multiple topic subscriptions under a single consumer group or per-topic groups, enabling scalable, isolated processing.
StarRocks StreamLoad Clients: Serialize and bulk-load data into StarRocks via HTTP Stream Load, with robust response handling and error reporting.

Section sources

Architecture Overview

The system uses Kafka as the central event bus to decouple external data sources from processing systems. Producers emit domain events (e.g., order changes, inventory updates) enriched with metadata. Consumers subscribe to topics, apply idempotent processing, and persist results to StarRocks via Stream Load. Transactions ensure atomicity for multi-step operations, while retries and DLQs handle transient failures and poison messages.

Diagram sources

Detailed Component Analysis

Kafka Producer: Serialization, Partitioning, and Compression

Serialization: Provides helpers for JSON payloads and raw byte keys/values, plus header support for cross-cutting attributes (e.g., correlation IDs, tenant info).
Partitioning: Uses configurable balancers (default round-robin) to distribute messages across partitions.
Compression: Supports Snappy/LZ4/Gzip to optimize network bandwidth and storage overhead.
Batching: Tunable batch size, bytes, and timeout to balance latency and throughput.
Async/Sync: Optional async mode for fire-and-forget scenarios; sync mode for acknowledgments.

Diagram sources

Section sources

Kafka Consumer: Manual/Auto Commit, Batch Consumption, and Graceful Shutdown

Modes: Automatic commit (ReadMessage) vs manual commit (FetchMessage + CommitMessages) for exactly-once semantics.
Batch consumption: Collects messages until size or timeout thresholds, then invokes a batch handler.
Health and diagnostics: Ping connectivity, lag inspection, and stats collection.
Graceful shutdown: Captures OS signals, cancels contexts, and closes consumers after in-flight work completes.

Diagram sources

Section sources

Transactional Producer: Idempotency and Exactly-Once Semantics

Requires idempotent producer configuration and explicit transaction lifecycle management.
Supports Begin/Commit/Abort transactions and executing functions within a transaction boundary.
Ensures atomic writes across multiple steps or topics.

Diagram sources

[producer_tx.go]

Section sources

[producer_tx.go]

Retry and Dead Letter Queue: Resilience and Poison Message Handling

Retry policy: Exponential backoff with jitter, configurable max attempts.
DLQ forwarding: On max retries exceeded, messages are forwarded to a dedicated DLQ topic with original topic and error headers.
Configurable per-consumer and per-batch consumption.

Diagram sources

Section sources

Multi-Topic Consumer: Scalable Orchestration

Creates independent consumers per topic with shared or per-topic group IDs.
Runs handlers concurrently and propagates the first error or context cancellation.
Provides aggregated stats and graceful shutdown across all consumers.

Section sources

[multi_consumer.go]

StarRocks Stream Load Integration: Bulk Persistence

Serializes payloads (JSON/CSV) and performs HTTP Stream Load to StarRocks.
Handles response parsing, success/failure detection, and logging.
Optimizes ingestion by avoiding small-file problems through larger batches and appropriate timeouts.

Diagram sources

Section sources

Dependency Analysis

The synchronization pipeline depends on:

Kafka producer/consumer abstractions for event transport
Retry/DLQ logic for resilience
Transactional producer for atomic operations
Stream Load clients for downstream persistence
Environment-driven configuration for brokers, credentials, and tuning

Diagram sources

Section sources

Performance Considerations

Batching and compression: Tune batch size/bytes/timeouts and choose compression (e.g., Snappy) to balance throughput and latency.
Partitioning: Select partition keys that distribute load evenly (e.g., entity IDs) and align with consumer scaling needs.
Consumer tuning: Adjust min/max bytes, max wait, and commit intervals to reduce tail latency and improve throughput.
Backpressure: Use batch consumption and manual commits to control processing rate; leverage DLQ to isolate problematic messages.
StarRocks ingestion: Increase batch sizes and set appropriate timeouts to avoid small-file problems and improve load performance.

[No sources needed since this section provides general guidance]

Troubleshooting Guide

Health checks: Use Ping to verify broker connectivity and metadata availability.
Lag monitoring: Inspect consumer lag and reader stats to detect backlog accumulation.
Graceful shutdown: Ensure consumers finish processing current messages before termination.
DLQ and retries: Verify max retry counts and DLQ topic configuration; inspect error headers for root cause.
Recovery: For Kafka, rely on consumer offsets and ISR guarantees; for StarRocks, validate Stream Load responses and re-ingest failed batches.

Section sources

Conclusion

The BI project employs a robust, event-driven synchronization architecture centered on Kafka. By combining idempotent producers, manual commit consumption, batch processing, retry/backoff, and DLQ forwarding, it achieves reliable, scalable data movement from external systems into StarRocks. Proper partitioning, compression, and consumer tuning further optimize performance. Monitoring lag, throughput, and data freshness ensures operational visibility, while transactional producers and DLQs provide strong reliability guarantees.

[No sources needed since this section summarizes without analyzing specific files]

Appendices

Incremental Synchronization Strategies

Timestamp-based filtering: Emit events with timestamps and filter by last-synchronized watermark to process only changed records.
Change Data Capture (CDC): Capture insert/update/delete events from source systems and propagate them as Kafka messages for downstream processing.
Idempotent processing: Use business keys and idempotent sinks to tolerate duplicates from retries or reprocessing.

Section sources

Message Serialization and Partitioning

Serialization: Prefer JSON for human-readable events; use raw bytes for compact binary formats.
Partitioning: Choose partition keys that ensure even distribution and temporal locality when needed.
Headers: Attach metadata (e.g., tenant, correlation ID) via message headers for routing and observability.

Section sources

Consumer Group Management

Consumer groups enable horizontal scaling; rebalance timeouts and heartbeats must be tuned for workload stability.
Single-partition mode is useful for deterministic ordering but limits concurrency.

Section sources

Dead Letter Queue and Retry Policies

Configure max retries and backoff; forward poisoned messages to DLQ with error context.
Monitor DLQ traffic and implement remediation workflows.

Section sources

Monitoring and Observability

Kafka: Track lag, consumer stats, and broker health; alert on rebalances and DLQ growth.
StarRocks: Monitor Stream Load success rates, row counts, and timing metrics.

Section sources

Rollback and Recovery Procedures

Kafka: Restore from known-good offsets; validate partition monitoring and ISR status.
StarRocks: Re-run failed Stream Load jobs; verify data consistency and re-index if needed.

Section sources

[backup-recovery.md]

Bi Analysis Analytics Engine

Bi Basic Foundation Data Services

Bi Server Business Orchestration Center

Bi Sys System Management Service

Bi Tenant Multi Tenant Management Service

Jushuitan Erp Integration

Leke Erp Integration

Data Synchronization Patterns ​

Table of Contents ​

Introduction ​

Project Structure ​

Core Components ​

Architecture Overview ​

Detailed Component Analysis ​

Kafka Producer: Serialization, Partitioning, and Compression ​

Kafka Consumer: Manual/Auto Commit, Batch Consumption, and Graceful Shutdown ​

Transactional Producer: Idempotency and Exactly-Once Semantics ​

Retry and Dead Letter Queue: Resilience and Poison Message Handling ​

Multi-Topic Consumer: Scalable Orchestration ​

StarRocks Stream Load Integration: Bulk Persistence ​

Dependency Analysis ​

Performance Considerations ​

Troubleshooting Guide ​

Conclusion ​

Appendices ​

Incremental Synchronization Strategies ​

Message Serialization and Partitioning ​

Consumer Group Management ​

Dead Letter Queue and Retry Policies ​

Monitoring and Observability ​

Rollback and Recovery Procedures ​

Data Synchronization Patterns

Table of Contents

Introduction

Project Structure

Core Components

Architecture Overview

Detailed Component Analysis

Kafka Producer: Serialization, Partitioning, and Compression

Kafka Consumer: Manual/Auto Commit, Batch Consumption, and Graceful Shutdown

Transactional Producer: Idempotency and Exactly-Once Semantics

Retry and Dead Letter Queue: Resilience and Poison Message Handling

Multi-Topic Consumer: Scalable Orchestration

StarRocks Stream Load Integration: Bulk Persistence

Dependency Analysis

Performance Considerations

Troubleshooting Guide

Conclusion

Appendices

Incremental Synchronization Strategies

Message Serialization and Partitioning

Consumer Group Management

Dead Letter Queue and Retry Policies

Monitoring and Observability

Rollback and Recovery Procedures