Skip to content

Partitioning and Indexing Strategies

**Referenced Files in This Document** - [[protocols-db.md]](file/ui-web/.agent/rules/protocols-db.md) - [[schema.sql (bi-tenant)]](file/bi-tenant/docs/database/schema.sql) - [[schema.sql (bi-sys)]](file/bi-sys/docs/database/schema.sql) - [[xcbi.sql]](file/bi-chat/bi-chat/src/db/xcbi.sql) - [[table-templates.md]](file/bi-basic/.agent/skills/bi-database-design/references/table-templates.md) - [[Infrastructure.md]](file/ui-web-docs/pages/zh/xcbi-dev/.md) - [[db_client.py]](file/mcp-server-starrocks/src/mcp-server-starrocks/db-client.py)

Table of Contents

  1. Introduction
  2. Project Structure
  3. Core Components
  4. Architecture Overview
  5. Detailed Component Analysis
  6. Dependency Analysis
  7. Performance Considerations
  8. Troubleshooting Guide
  9. Conclusion

Introduction

This document explains StarRocks partitioning and indexing strategies used in the BI Analysis Platform. It focuses on:

  • Hash partitioning and bucket allocation via DISTRIBUTED BY HASH for balanced query distribution
  • Bitmap index usage for low-cardinality filtering columns such as status, deleted_at, and tenant_id
  • Primary key and unique index patterns for data integrity
  • Index selection criteria aligned with query patterns and trade-offs between maintenance overhead and query performance
  • Partition sizing recommendations, bucket number optimization, and maintenance procedures
  • Impact of partitioning on data loading performance and query execution plans

Project Structure

The repository contains multiple modules that define StarRocks schemas and operational guidelines:

  • Database design rules and recommendations for StarRocks
  • Concrete StarRocks table definitions across modules (tenant, system, chat analytics)
  • Infrastructure notes on StarRocks usage and index strategies
  • Tools for profiling StarRocks queries

Diagram sources

Section sources

Core Components

  • Partitioning and distribution
    • Hash distribution by high-cardinality keys ensures even data spread across backends
    • Dynamic partitions enable automated monthly partition creation and retention policies
  • Indexing
    • Bitmap indexes on low-cardinality filters (status, deleted_at, tenant_id) accelerate predicate evaluation
    • Persistent index enabled for primary-key tables to improve scan performance
  • Query profiling
    • Query dump and profile collection support performance diagnostics and plan inspection

Key implementation references:

  • Hash distribution and dynamic partitioning in tenant and system modules
  • Bitmap indexes on status and deleted_at fields
  • Persistent index property on primary-key tables
  • Query profiling utilities

Section sources

Architecture Overview

The BI platform leverages StarRocks for analytical workloads with:

  • Fact tables using duplicate key for event logs with dynamic partitions
  • Dimension tables using primary key with persistent index and bitmap indexes
  • Distributed hashing on high-cardinality keys to balance load
  • Query profiling to validate execution plans and optimize further

Diagram sources

Detailed Component Analysis

Hash Partitioning and Bucket Allocation

  • Distribution strategy
    • Tables distribute data using hash on high-cardinality keys (e.g., tenant_id, user_id, log_id) to achieve uniform write/read distribution
    • Bucket count set to moderate values to balance parallelism and metadata overhead
  • Dynamic partitions
    • Monthly partitioning with automatic creation and retention windows reduces manual maintenance and supports time-based pruning
  • Recommendations
    • Align bucket count with cluster size and expected row counts per tablet
    • Prefer colocate join for frequently joined large tables by aligning distribution keys and bucket counts

Diagram sources

Section sources

Bitmap Index Implementation

  • Purpose
    • Accelerate filtering on low-cardinality columns (status, deleted_at, tenant_id) commonly used in WHERE clauses
  • Placement
    • Created on dimension and fact tables where these filters are frequent
  • Trade-offs
    • Bitmap indexes reduce scan time but increase storage and update overhead during writes
    • Recommended for static or slowly changing attributes; avoid on very high-cardinality columns

Diagram sources

Section sources

Primary Key Design Patterns and Unique Index Strategies

  • Primary key tables
    • Use primary key model for business entities requiring updatable/deletable records
    • Define ORDER BY key as the primary key to leverage prefix index and sort efficiency
    • Enable persistent index to improve read performance on scans
  • Unique constraints
    • Enforce uniqueness on identifiers (e.g., tenant_id, user_id) to maintain referential integrity
  • Recommendations
    • Keep primary key columns non-null and constrained
    • Place frequently filtered columns early in sort key to maximize prefix index hit rate

Diagram sources

Section sources

Index Selection Criteria Based on Query Patterns

  • Choose bitmap indexes for:
    • Low-cardinality filters (status, is_deleted, tenant_id)
    • Frequent equality or IN predicates
  • Avoid bitmap indexes for:
    • Very high-cardinality columns (e.g., IDs used solely for joins)
  • Bloom filter considerations:
    • Enable bloom filters on high-cardinality columns used in equality predicates to reduce probe cost
  • Persistent index:
    • Beneficial for primary-key tables with wide scans and frequent reads

Diagram sources

Section sources

Partition Sizing Recommendations and Bucket Number Optimization

  • Partition sizing
    • Use monthly partitions for time-series logs; adjust window size based on retention needs and ingestion volume
    • Ensure partition pruning is effective by pushing time filters to leverage range partitions
  • Bucket optimization
    • Target 1–10 GB per tablet; compute buckets ≈ total_size / 5 GB
    • For small tables (<100M rows), set buckets to 3–5 × number of BE nodes or use auto bucketing
    • Avoid too few buckets (metadata pressure) or too many buckets (reduced parallelism)
  • Co-location joins
    • Align distribution keys and bucket counts for frequently joined large tables to eliminate network shuffle

Diagram sources

Section sources

Index Maintenance Procedures

  • Monitoring
    • Track write amplification caused by bitmap and persistent index updates
    • Observe partition pruning effectiveness and bucket skew
  • Tuning
    • Adjust bucket counts and tablet sizes based on growth trends
    • Rebuild or reorganize partitions periodically for skewed distributions
  • Operational notes
    • Use dynamic partitions to automate lifecycle management
    • Apply retention policies to drop old partitions regularly

[No sources needed since this section provides general guidance]

Impact on Data Loading Performance and Query Execution Plans

  • Loading
    • Hash distribution balances write throughput; ensure adequate bucket count to prevent hotspots
    • Dynamic partitions simplify retention and reduce administrative overhead
  • Query plans
    • Bitmap indexes reduce row groups scanned for low-cardinality filters
    • Persistent index improves scan performance on primary-key tables
    • Use query profiling to inspect execution plans and validate index usage

Diagram sources

Section sources

Dependency Analysis

  • Table definitions depend on:
    • Design rules for partitioning, distribution, and index selection
    • Module-specific schemas that demonstrate practical usage
  • Query profiling depends on:
    • StarRocks client utilities to capture execution dumps and profiles

Diagram sources

Section sources

Performance Considerations

  • Prefer monthly dynamic partitions for time-series logs to simplify lifecycle management
  • Use bitmap indexes on low-cardinality filters; avoid on high-cardinality columns
  • Enable persistent index for primary-key tables with frequent scans
  • Monitor and tune bucket counts to maintain balanced parallelism and tablet sizes
  • Leverage query profiling to validate plan efficiency and index usage

[No sources needed since this section provides general guidance]

Troubleshooting Guide

  • Slow queries on low-cardinality filters
    • Verify bitmap indexes exist and are being used
    • Confirm partition pruning is active by pushing time filters
  • Skewed distribution or hotspots
    • Increase bucket count or adjust distribution key
    • Reassess partition boundaries and retention windows
  • Excessive write overhead
    • Limit bitmap indexes on highly active columns
    • Consider bloom filters for equality predicates on high-cardinality columns
  • Query plan inspection
    • Use profiler utilities to capture execution dumps and measure durations

Section sources

Conclusion

The BI Analysis Platform employs StarRocks partitioning and indexing best practices:

  • Hash distribution with dynamic partitions for balanced, scalable ingestion and pruning
  • Bitmap indexes on low-cardinality filters to accelerate common queries
  • Persistent index and primary key modeling for integrity and read performance
  • Practical tuning guidelines for bucket sizing, co-location joins, and maintenance These strategies collectively improve query performance while controlling maintenance overhead and ensuring predictable scaling.