Partitioning and Indexing Strategies
Table of Contents
- Introduction
- Project Structure
- Core Components
- Architecture Overview
- Detailed Component Analysis
- Dependency Analysis
- Performance Considerations
- Troubleshooting Guide
- Conclusion
Introduction
This document explains StarRocks partitioning and indexing strategies used in the BI Analysis Platform. It focuses on:
- Hash partitioning and bucket allocation via DISTRIBUTED BY HASH for balanced query distribution
- Bitmap index usage for low-cardinality filtering columns such as status, deleted_at, and tenant_id
- Primary key and unique index patterns for data integrity
- Index selection criteria aligned with query patterns and trade-offs between maintenance overhead and query performance
- Partition sizing recommendations, bucket number optimization, and maintenance procedures
- Impact of partitioning on data loading performance and query execution plans
Project Structure
The repository contains multiple modules that define StarRocks schemas and operational guidelines:
- Database design rules and recommendations for StarRocks
- Concrete StarRocks table definitions across modules (tenant, system, chat analytics)
- Infrastructure notes on StarRocks usage and index strategies
- Tools for profiling StarRocks queries
Diagram sources
- [protocols-db.md]
- [schema.sql (bi-tenant)]
- [schema.sql (bi-sys)]
- [xcbi.sql]
- [table-templates.md]
- [Infrastructure.md]
- [db_client.py]
Section sources
- [protocols-db.md]
- [schema.sql (bi-tenant)]
- [schema.sql (bi-sys)]
- [xcbi.sql]
- [table-templates.md]
- [Infrastructure.md]
- [db_client.py]
Core Components
- Partitioning and distribution
- Hash distribution by high-cardinality keys ensures even data spread across backends
- Dynamic partitions enable automated monthly partition creation and retention policies
- Indexing
- Bitmap indexes on low-cardinality filters (status, deleted_at, tenant_id) accelerate predicate evaluation
- Persistent index enabled for primary-key tables to improve scan performance
- Query profiling
- Query dump and profile collection support performance diagnostics and plan inspection
Key implementation references:
- Hash distribution and dynamic partitioning in tenant and system modules
- Bitmap indexes on status and deleted_at fields
- Persistent index property on primary-key tables
- Query profiling utilities
Section sources
- [schema.sql (bi-tenant)]
- [schema.sql (bi-tenant)]
- [schema.sql (bi-sys)]
- [xcbi.sql]
- [table-templates.md]
- [protocols-db.md]
- [db_client.py]
Architecture Overview
The BI platform leverages StarRocks for analytical workloads with:
- Fact tables using duplicate key for event logs with dynamic partitions
- Dimension tables using primary key with persistent index and bitmap indexes
- Distributed hashing on high-cardinality keys to balance load
- Query profiling to validate execution plans and optimize further
Diagram sources
Detailed Component Analysis
Hash Partitioning and Bucket Allocation
- Distribution strategy
- Tables distribute data using hash on high-cardinality keys (e.g., tenant_id, user_id, log_id) to achieve uniform write/read distribution
- Bucket count set to moderate values to balance parallelism and metadata overhead
- Dynamic partitions
- Monthly partitioning with automatic creation and retention windows reduces manual maintenance and supports time-based pruning
- Recommendations
- Align bucket count with cluster size and expected row counts per tablet
- Prefer colocate join for frequently joined large tables by aligning distribution keys and bucket counts
Diagram sources
Section sources
Bitmap Index Implementation
- Purpose
- Accelerate filtering on low-cardinality columns (status, deleted_at, tenant_id) commonly used in WHERE clauses
- Placement
- Created on dimension and fact tables where these filters are frequent
- Trade-offs
- Bitmap indexes reduce scan time but increase storage and update overhead during writes
- Recommended for static or slowly changing attributes; avoid on very high-cardinality columns
Diagram sources
- [schema.sql (bi-tenant)]
- [schema.sql (bi-tenant)]
- [schema.sql (bi-tenant)]
- [schema.sql (bi-sys)]
- [table-templates.md]
Section sources
- [schema.sql (bi-tenant)]
- [schema.sql (bi-tenant)]
- [schema.sql (bi-tenant)]
- [schema.sql (bi-sys)]
- [table-templates.md]
- [protocols-db.md]
Primary Key Design Patterns and Unique Index Strategies
- Primary key tables
- Use primary key model for business entities requiring updatable/deletable records
- Define ORDER BY key as the primary key to leverage prefix index and sort efficiency
- Enable persistent index to improve read performance on scans
- Unique constraints
- Enforce uniqueness on identifiers (e.g., tenant_id, user_id) to maintain referential integrity
- Recommendations
- Keep primary key columns non-null and constrained
- Place frequently filtered columns early in sort key to maximize prefix index hit rate
Diagram sources
Section sources
Index Selection Criteria Based on Query Patterns
- Choose bitmap indexes for:
- Low-cardinality filters (status, is_deleted, tenant_id)
- Frequent equality or IN predicates
- Avoid bitmap indexes for:
- Very high-cardinality columns (e.g., IDs used solely for joins)
- Bloom filter considerations:
- Enable bloom filters on high-cardinality columns used in equality predicates to reduce probe cost
- Persistent index:
- Beneficial for primary-key tables with wide scans and frequent reads
Diagram sources
Section sources
Partition Sizing Recommendations and Bucket Number Optimization
- Partition sizing
- Use monthly partitions for time-series logs; adjust window size based on retention needs and ingestion volume
- Ensure partition pruning is effective by pushing time filters to leverage range partitions
- Bucket optimization
- Target 1–10 GB per tablet; compute buckets ≈ total_size / 5 GB
- For small tables (<100M rows), set buckets to 3–5 × number of BE nodes or use auto bucketing
- Avoid too few buckets (metadata pressure) or too many buckets (reduced parallelism)
- Co-location joins
- Align distribution keys and bucket counts for frequently joined large tables to eliminate network shuffle
Diagram sources
Section sources
Index Maintenance Procedures
- Monitoring
- Track write amplification caused by bitmap and persistent index updates
- Observe partition pruning effectiveness and bucket skew
- Tuning
- Adjust bucket counts and tablet sizes based on growth trends
- Rebuild or reorganize partitions periodically for skewed distributions
- Operational notes
- Use dynamic partitions to automate lifecycle management
- Apply retention policies to drop old partitions regularly
[No sources needed since this section provides general guidance]
Impact on Data Loading Performance and Query Execution Plans
- Loading
- Hash distribution balances write throughput; ensure adequate bucket count to prevent hotspots
- Dynamic partitions simplify retention and reduce administrative overhead
- Query plans
- Bitmap indexes reduce row groups scanned for low-cardinality filters
- Persistent index improves scan performance on primary-key tables
- Use query profiling to inspect execution plans and validate index usage
Diagram sources
Section sources
Dependency Analysis
- Table definitions depend on:
- Design rules for partitioning, distribution, and index selection
- Module-specific schemas that demonstrate practical usage
- Query profiling depends on:
- StarRocks client utilities to capture execution dumps and profiles
Diagram sources
- [protocols-db.md]
- [schema.sql (bi-tenant)]
- [schema.sql (bi-sys)]
- [xcbi.sql]
- [table-templates.md]
- [db_client.py]
Section sources
- [protocols-db.md]
- [schema.sql (bi-tenant)]
- [schema.sql (bi-sys)]
- [xcbi.sql]
- [table-templates.md]
- [db_client.py]
Performance Considerations
- Prefer monthly dynamic partitions for time-series logs to simplify lifecycle management
- Use bitmap indexes on low-cardinality filters; avoid on high-cardinality columns
- Enable persistent index for primary-key tables with frequent scans
- Monitor and tune bucket counts to maintain balanced parallelism and tablet sizes
- Leverage query profiling to validate plan efficiency and index usage
[No sources needed since this section provides general guidance]
Troubleshooting Guide
- Slow queries on low-cardinality filters
- Verify bitmap indexes exist and are being used
- Confirm partition pruning is active by pushing time filters
- Skewed distribution or hotspots
- Increase bucket count or adjust distribution key
- Reassess partition boundaries and retention windows
- Excessive write overhead
- Limit bitmap indexes on highly active columns
- Consider bloom filters for equality predicates on high-cardinality columns
- Query plan inspection
- Use profiler utilities to capture execution dumps and measure durations
Section sources
Conclusion
The BI Analysis Platform employs StarRocks partitioning and indexing best practices:
- Hash distribution with dynamic partitions for balanced, scalable ingestion and pruning
- Bitmap indexes on low-cardinality filters to accelerate common queries
- Persistent index and primary key modeling for integrity and read performance
- Practical tuning guidelines for bucket sizing, co-location joins, and maintenance These strategies collectively improve query performance while controlling maintenance overhead and ensuring predictable scaling.