Data Loading and Optimization
Table of Contents
- Introduction
- Project Structure
- Core Components
- Architecture Overview
- Detailed Component Analysis
- Dependency Analysis
- Performance Considerations
- Troubleshooting Guide
- Conclusion
- Appendices
Introduction
This document provides comprehensive guidance for StarRocks data loading and optimization within the BI Analysis Platform. It covers the StreamLoad API implementation, client configuration, batch processing strategies, error handling, and the end-to-end ingestion pipeline from external sources through Kafka to StarRocks. It also documents StreamLoad options (compression, timeouts, transaction management), bulk loading strategies, partition-aware loading, incremental update patterns, monitoring and debugging techniques, performance benchmarks, capacity planning, data quality checks, validation rules, and rollback procedures for failed loads.
Project Structure
The data loading stack is composed of:
- StreamLoad client library under bi-common for StarRocks ingestion
- Application-specific adapters that construct and submit loads
- Kafka producers/consumers under bi-common for event-driven ingestion
- Protocol models for Kafka payloads
- Configuration sources for StarRocks and Kafka
Diagram sources
Section sources
Core Components
- StreamLoad Client: HTTP-based client that submits data to StarRocks FE via the StreamLoad endpoint. Handles authentication, redirects, labels, and response parsing.
- Load Options: Functional options to configure CSV/JSON formats, column mapping, separators, strict mode, filter ratios, partitions, timeouts, and JSON paths.
- Response Model: Structured response with status, row counts, timing metrics, and optional error URL.
- Application Adapters: Thin wrappers around the StreamLoad client to adapt domain models and handle logging and retries.
- Kafka Consumers: Robust consumers supporting single-partition and consumer-group modes, worker pools, graceful shutdown, and manual commit.
Key capabilities:
- CSV and JSON ingestion with flexible options
- Automatic label generation and transaction visibility
- Strict mode and max-filter-ratio for quality control
- Partition targeting for time-partitioned tables
- Comprehensive response handling and error reporting
Section sources
Architecture Overview
End-to-end ingestion pipeline from external systems to StarRocks:
Diagram sources
Detailed Component Analysis
StreamLoad Client
- Responsibilities:
- Construct HTTP requests with Basic auth and label
- Apply format and options via headers
- Handle redirects and BE proxy substitution
- Parse JSON responses into structured results
- Notable behaviors:
- Auto-generates label if not provided
- Supports CSV and JSON formats
- Enforces Expect: 100-continue
- Limits redirect depth to prevent loops
Diagram sources
Section sources
Load Options and Configuration
- Options include:
- Label, columns, separators, row delimiter
- Max filter ratio, partitions, strict mode
- Timeout, strip outer array, JSON paths
- Configuration supports:
- Host/port, credentials, timeout, retry, BE proxy host/port
- Defaults applied when unspecified
Diagram sources
Section sources
Application StreamLoad Adapters
- bi-api-jushuitan and bi-basic wrap the StreamLoad client to:
- Initialize client from datasource configuration
- Convert domain models to JSON
- Log successes/failures and handle empty batches
- Use strip-outer-array for JSON arrays
Diagram sources
Section sources
Kafka Consumers and Multi-Topic Management
- Consumer supports:
- Single partition and consumer group modes
- Worker pools, graceful shutdown, manual commit
- Health checks, lag, and stats
- ConsumerManager:
- Registers handlers globally or per-instance
- Starts multiple consumers with worker counts
- Applies ConsumeConfig for retries and DLQ
- MultiTopicConsumer:
- Manages multiple topics with shared config
- Parallel consumption with coordinated shutdown
Diagram sources
Section sources
Retry and Backoff Policy
- Exponential backoff with jitter and capped maximum
- Configurable max retries and initial/backoff bounds
- Used to tune resilience of downstream processing
Diagram sources
Section sources
Dependency Analysis
- StreamLoad client depends on:
- HTTP transport with custom redirect handling
- Base64 encoding for Basic auth
- JSON parsing for responses
- Application adapters depend on:
- Datasource configuration (host, port, credentials)
- Logging helpers
- Optional BE proxy settings
- Kafka consumers depend on:
- segmentio/kafka-go
- Environment and configuration sources
- Optional SASL/TLS settings
Diagram sources
Section sources
Performance Considerations
- Batch sizing:
- Recommended batch size: 10 MB to 100 MB to reduce transaction overhead and improve throughput
- Timeout tuning:
- Increase StreamLoad timeout for large payloads
- Align consumer poll and commit intervals to reduce latency
- Compression:
- Prefer JSON with strip_outer_array for compact payloads
- Use appropriate Kafka compression (snappy/lz4/zstd) at producer level
- Partitioning:
- Use partition-aware loading to target time-partitioned tables
- Ensure adequate partition counts for Kafka topics
- Strict mode and filter ratio:
- Enable strict mode for schema enforcement
- Set max_filter_ratio to tolerate controlled error rates
- Monitoring:
- Track LoadTimeMs, BeginTxnTimeMs, WriteDataTimeMs, and CommitAndPublishTimeMs
- Monitor compaction_score and lag to detect bottlenecks
[No sources needed since this section provides general guidance]
Troubleshooting Guide
Common issues and remedies:
- Authentication failures:
- Verify username/password and Basic auth header construction
- Redirect loops or BE address resolution:
- Configure BE proxy host/port for local development
- Duplicate label errors:
- Ensure label uniqueness per import
- Large payload timeouts:
- Increase StreamLoad timeout and consider chunking
- JSON parsing errors:
- Use strip_outer_array for top-level arrays
- Validate JSONPaths when required
- Consumer lag accumulation:
- Increase worker pool size or adjust commit intervals
- Check broker connectivity and SASL/TLS settings
Operational checks:
- Use simulation tests to validate end-to-end ingestion
- Inspect response.ErrorURL for detailed failure diagnostics
- Confirm partition assignment and group rebalance behavior
Section sources
Conclusion
The BI platform leverages a robust StreamLoad client and Kafka-based ingestion pipeline to achieve efficient, scalable, and observable data loading into StarRocks. By applying recommended batch sizes, timeouts, partitioning, and strict quality controls, teams can optimize throughput while maintaining reliability. The provided adapters and consumers offer practical building blocks for incremental updates, monitoring, and operational resilience.
[No sources needed since this section summarizes without analyzing specific files]
Appendices
StreamLoad Options Reference
- WithLabel(label): Set import label
- WithColumns(cols...): Column mapping
- WithColumnSeparator(sep): CSV column separator
- WithRowDelimiter(delim): Row delimiter
- WithMaxFilterRatio(ratio): Maximum filtered row ratio
- WithPartitions(partitions): Target partitions
- WithStrictMode(strict): Enable strict mode
- WithTimeout(seconds): Request timeout
- WithStripOuterArray(strip): Strip top-level JSON array
- WithJsonPaths(paths): JSON path mapping
Section sources
Configuration Examples
- StarRocks datasource configuration (YAML):
- Host, port, username, password, pool settings, and log levels
- Kafka consumer configuration:
- Brokers, topic, group ID, worker counts, offsets, commit intervals, isolation level
Section sources
Data Quality and Validation
- Use strict mode to enforce schema correctness
- Allow controlled error rates via max_filter_ratio
- Validate JSON payloads and column mappings
- Monitor response metrics and error URLs
Section sources
Rollback and Recovery Procedures
- For failed loads:
- Inspect response status and message
- Use ErrorURL for detailed diagnostics
- Reattempt with corrected data or reduced batch size
- For consumer issues:
- Adjust worker pool size and commit intervals
- Investigate lag and rebalance behavior
- Use manual commit mode for idempotent processing
Section sources