Skip to content

Data Governance and Monitoring

**Referenced Files in This Document** - [[logger.go]](file/bi-common/observability/logger/logger.go) - [[metrics.go]](file/bi-common/observability/metrics/metrics.go) - [[common.pb.go]](file/bi-common/conf/common.pb.go) - [[k8s-cluster.json]](file/bi-intra/charts/grafana/dashboards/k8s-cluster.json) - [[_config.tpl]](file/bi-intra/charts/grafana/templates/config.tpl) - [[design.md]](file/bi-analysis/docs/database/design.md) - [[import_metrics.py]](file/bi-chat/bi-chat/src/scripts/import-metrics.py) - [[security-patterns.md]](file/bi-basic/.agent/skills/bi-security/references/security-patterns.md) - [[security-patterns.md]](file/.agent/skills/bi-security/references/security-patterns.md) - [[step-06-validation-design-check.md]](file/bmad/bmb/workflows/workflow/steps-v/step-06-validation-design-check.md) - [[step-02-file-structure.md]](file/bmad/bmb/workflows/module/steps-v/step-02-file-structure.md) - [[step-09-cohesive-review.md]](file/bmad/bmb/workflows/workflow/steps-v/step-09-cohesive-review.md) - [[conceptual.md]](file/ui-web-docs/pages/en/openspec/concepts.md) - [[commands.html]](file/ui-web-docs/dist/zh/openspec/commands.html) - [[db_client.py]](file/mcp-server-starrocks/src/mcp-server-starrocks/db-client.py)

Table of Contents

  1. Introduction
  2. Project Structure
  3. Core Components
  4. Architecture Overview
  5. Detailed Component Analysis
  6. Dependency Analysis
  7. Performance Considerations
  8. Troubleshooting Guide
  9. Conclusion
  10. Appendices

Introduction

This document defines the data governance and monitoring framework for the BI Analysis Platform. It covers:

  • Data quality monitoring: validation rules, anomaly detection, and completeness checks
  • Logging architecture: structured logging, rotation, and centralized ingestion
  • Metrics collection: database performance, query execution times, and pipeline throughput
  • Audit trail: data changes, user actions, and system events
  • Data lineage and impact analysis for schema changes
  • Compliance reporting and retention/archival/backup procedures
  • Dashboards, alerting thresholds, and incident response procedures

Project Structure

The platform is composed of multiple microservices and shared libraries. Observability primitives are provided by the bi-common module, while domain-specific dashboards and alerting are managed via Helm charts in bi-intra. Data governance and validation workflows are captured in B-MAD specifications.

Diagram sources

Section sources

Core Components

  • Structured logging with configurable outputs and rotation
  • Centralized logging via Aliyun SLS with fallback to local files
  • Prometheus metrics for HTTP requests, DB queries, cache, and business operations
  • Grafana dashboards and alerting configurations
  • Validation and governance workflows for quality assurance
  • Security audit logging patterns for sensitive operations
  • Data lineage ingestion for metrics and schema-aware impact analysis

Section sources

Architecture Overview

The monitoring stack integrates logging, metrics, and dashboards across services. Logs are emitted with structured JSON or text formats, rotated locally or daily, and optionally mirrored to Aliyun SLS. Metrics are exposed via Prometheus and visualized in Grafana dashboards with threshold-based alerts.

Diagram sources

Detailed Component Analysis

Logging Architecture

  • Configuration supports JSON or text format, stdout/file/both outputs, and optional Aliyun SLS forwarding
  • File rotation supports size-based rotation (via lumberjack) and daily rotation with cleanup hooks
  • Aliyun logger is optional; failures are logged but do not block local logging

Diagram sources

Section sources

Metrics Collection

  • HTTP request counts and durations by method/path
  • DB query counts, durations, and errors by table/operation
  • Cache hits/misses and operation durations by cache type
  • Business operation counters by module/operation/status

Diagram sources

Section sources

Data Quality Monitoring Framework

  • Validation rules: enforce presence of headers in CSV, criteria in Markdown, and explicit pass/fail criteria in validation steps
  • Anomaly detection: leverage metrics histograms for latency and error spikes; configure Grafana thresholds
  • Completeness checks: ensure validation data files exist and are referenced in workflow frontmatter

Diagram sources

Section sources

Audit Trail Implementation

  • Security audit log model captures event type, user identity, resource, action, result, risk level, and timestamp
  • Required audit categories include login/logout, permission changes, data export, password reset, sensitive data access, configuration changes, and account lock/unlock

Diagram sources

Section sources

Data Lineage Tracking and Impact Analysis

  • Metrics are imported into a graph model with nodes for metrics, categories, remarks, and paths
  • When a path is provided (e.g., table.field), the script links metrics to column nodes for lineage
  • Schema evolution is tracked in domain documentation; lineage can be used to assess impact of schema changes

Diagram sources

Section sources

Database Performance and Query Execution Monitoring

  • The StarRocks MCP server collects performance analysis inputs and measures query duration with profiling enabled
  • Errors are captured alongside timing and row counts for downstream alerting and analysis

Diagram sources

Section sources

Compliance Reporting and Retention/Archival/Backup

  • Archiving process preserves change context with deltas merged into a clean state; archive retains full context for historical understanding
  • Bulk archive operations detect cross-change conflicts and resolve via verification and implementation checks

Diagram sources

Section sources

Dependency Analysis

  • Logging depends on configuration schema for file rotation parameters and supports optional Aliyun forwarding
  • Metrics are exported to Prometheus; Grafana consumes them with dashboard-specific thresholds
  • Validation workflows depend on structured validation data presence and explicit criteria
  • Security audit patterns define the canonical log structure for compliance

Diagram sources

Section sources

Performance Considerations

  • Prefer size-based rotation with reasonable max age/backups; enable compression to reduce disk usage
  • Use daily rotation for high-volume environments to cap per-file sizes
  • Keep JSON logs for machine parsing; text logs for human readability
  • Expose metrics with appropriate label cardinality to avoid scraping overhead
  • Monitor cache hit ratios and DB query latency histograms to identify bottlenecks early

[No sources needed since this section provides general guidance]

Troubleshooting Guide

  • Logging issues
    • Verify configuration fields for path, max size, max age, backups, and daily rotation
    • Check Aliyun logger initialization errors; local logging remains active if Aliyun fails
  • Metrics anomalies
    • Inspect HTTP and DB error counters; compare durations against histogram buckets
    • Confirm Prometheus scrape targets and label values
  • Validation failures
    • Ensure validation data files exist and meet structural requirements
    • Confirm pass/fail criteria and systematic check sequencing
  • Audit trail gaps
    • Confirm security audit log fields and risk levels are populated for required operations

Section sources

Conclusion

The BI Analysis Platform implements a robust observability foundation with structured logging, configurable rotation, and centralized forwarding. Metrics capture critical performance signals, while Grafana dashboards and thresholds enable proactive monitoring. Governance workflows ensure validation rigor, and security audit patterns support compliance. Data lineage and schema documentation facilitate impact analysis for changes. Archival and bulk operations preserve context for audits and future reference.

[No sources needed since this section summarizes without analyzing specific files]

Appendices

Monitoring Dashboards and Alerting Thresholds

  • Cluster dashboard JSON defines unit conversions, thresholds, and display modes for various metrics
  • Alerting configurations are templated and rendered via Helm values

Diagram sources

Section sources