Troubleshooting and Maintenance
Table of Contents
- Introduction
- Project Structure
- Core Components
- Architecture Overview
- Detailed Component Analysis
- Dependency Analysis
- Performance Considerations
- Troubleshooting Guide
- Maintenance Procedures
- Incident Response and Recovery
- Conclusion
- Appendices
Introduction
This document provides comprehensive troubleshooting and maintenance guidance for the BI Analysis Platform. It covers service connectivity, performance bottlenecks, and data synchronization issues across microservices, databases, and messaging systems. It also explains monitoring and observability approaches, log analysis techniques, and diagnostic procedures. Guidance is organized by operational scenario with actionable checklists and diagrams mapped to actual source files.
Project Structure
The platform is composed of multiple Go microservices under the bi-* packages, each bootstrapped via Kratos and configured via Nacos. Observability is centralized through a shared logger, while messaging and persistence rely on Kafka and GORM-based clients. The bi-analysis service demonstrates the standard startup flow and configuration loading pattern used across services.
Diagram sources
- [main.go]
- [application-dev.yaml]
- [application-prod.yaml]
- [logger.go]
- [config.go]
- [client.go]
- [config.go]
- [consumer.go]
- [config.go]
Section sources
Core Components
- Bootstrap and configuration: The service loads environment-specific Nacos configuration and initializes logging and registry.
- Logging: Multi-output logger supporting stdout, file rotation, and optional Aliyun forwarding.
- Registry (Nacos): Centralized configuration and service discovery client with validation and defaults.
- Messaging (Kafka): Robust consumer with graceful shutdown, health checks, and stats.
- Persistence (GORM): Structured database configuration with connection pooling, TLS, and slow query logging.
Section sources
Architecture Overview
The bi-analysis service follows a standard Kratos bootstrap pattern:
- Parse environment flag
- Load Nacos configuration source
- Scan into Bootstrap config
- Initialize logger and registry
- Wire application with servers and services
- Run until stop signal
Diagram sources
Detailed Component Analysis
Service Bootstrapping and Configuration Loading
- Environment selection: The -env flag selects application-dev.yaml or application-prod.yaml.
- Nacos configuration source: Loads multiple DataIds per environment.
- Bootstrap scanning: YAML is scanned into a typed configuration structure.
- Logger initialization: Supports JSON/text output, file rotation, and Aliyun forwarding.
- Registry wiring: Creates a registry client for service discovery.
Diagram sources
Section sources
Logging and Observability
- Multi-output logger: stdout, file, or both; supports JSON and text formats.
- File rotation: Size-based (Lumberjack) or daily rotation with retention.
- Aliyun forwarding: Optional integration with structured error handling.
- Cleanup: Ensures resources are closed on shutdown.
Operational tips:
- Increase verbosity temporarily for diagnostics using the log level setting.
- Enable JSON format for centralized log parsing.
- Monitor disk usage for rotated files.
Section sources
Nacos Registry and Configuration
- Client creation: Validates host/port and applies defaults.
- Naming and config clients: Lazy-initialized with thread-safe guards.
- Config retrieval: Supports single or multiple DataIds.
- Advanced options: Cache behavior, snapshot usage, and thread count.
Common issues:
- Invalid server address or port leads to immediate failure during client creation.
- Missing DataIds or wrong group/namespace causes empty config load.
- Authentication mismatch requires correct credentials in environment or config.
Section sources
Kafka Consumer Diagnostics
- Health check: Connects to broker and lists brokers to validate connectivity.
- Modes: Consumer group mode (recommended) and single-partition mode.
- Graceful shutdown: Listens for OS signals and waits up to 30 seconds before force-close.
- Stats and lag: Exposes reader stats and lag for monitoring.
Operational tips:
- Use consumer group mode for horizontal scaling.
- Monitor lag to detect backpressure or stuck consumers.
- Enable partition change watching for topics created dynamically.
Section sources
Database Connectivity and Performance
- Driver support: MySQL and StarRocks with optimized defaults.
- Connection pool tuning: Open/idle counts and lifetimes.
- Slow query logging: Threshold-based warnings.
- TLS: Optional custom or system trust with SNI support.
Operational tips:
- Adjust pool sizes based on workload concurrency.
- Enable slow query logging in staging to identify hotspots.
- For StarRocks, ensure interpolation parameters are enabled in DSN.
Section sources
Dependency Analysis
The bi-analysis service depends on shared modules for logging, registry, messaging, and database. Coupling is low due to configuration-driven initialization and modular clients.
Diagram sources
Section sources
Performance Considerations
- Logging overhead: Prefer file rotation with appropriate max size and age to avoid frequent rotations.
- Kafka throughput: Tune batch size/timeouts, compression, and balancing; monitor lag and consumer group rebalances.
- Database pools: Match MaxOpenConns to CPU and network capacity; set ConnMaxLifetime to prevent stale connections.
- Nacos latency: Reduce timeout_ms and ensure correct namespace/group to minimize retries.
[No sources needed since this section provides general guidance]
Troubleshooting Guide
Service Connectivity Problems
Symptoms:
- Application fails to start or exits immediately.
- No configuration loaded from Nacos.
Checklist:
- Verify -env flag matches existing application-*.yaml.
- Confirm Nacos server address/port and context path.
- Ensure DataIds exist in the selected namespace/group.
- Validate authentication credentials if enabled.
Diagnostics:
- Inspect logger output for early panic or error messages.
- Use Nacos client’s built-in validation and error propagation.
Section sources
Performance Bottlenecks
Symptoms:
- High consumer lag, delayed message processing.
- Database timeouts or slow queries.
Checklist:
- Review Kafka consumer stats and lag.
- Inspect database slow query logs and pool utilization.
- Evaluate compression and batching settings.
Diagnostics:
- Use consumer Stats() and Lag() to quantify backlog.
- Enable GORM slow query logging and adjust thresholds.
Section sources
Data Synchronization Errors
Symptoms:
- Messages processed but offsets not committed.
- Consumers stuck at specific partitions.
Checklist:
- Confirm commit intervals and isolation level.
- Validate consumer group configuration and rebalance timeouts.
- Check partition change watching for dynamic topics.
Diagnostics:
- Use Ping() to verify broker connectivity.
- Enable partition change watching and monitor logs.
Section sources
Microservices Debugging Strategies
- Capture startup logs and configuration scan results.
- Temporarily increase log level for targeted debugging.
- Use graceful shutdown signals to observe cleanup behavior.
Section sources
Database Performance Issues
- Tune connection pool parameters based on observed concurrency.
- Enable TLS appropriately; verify certificates and SNI.
- Monitor slow queries and optimize heavy queries.
Section sources
AI Component Failures (Conceptual)
- Validate model endpoints and authentication.
- Monitor latency and error rates; enable retries with backoff.
- Use circuit breaker patterns to protect downstream services.
[No sources needed since this section doesn't analyze specific files]
Maintenance Procedures
- Configuration updates: Publish new DataIds to Nacos; verify listeners and rolling restarts.
- Logging rotation: Adjust file size/age; monitor disk usage.
- Kafka maintenance: Rotate brokers, update ACLs, and validate topic configurations.
- Database maintenance: Apply schema migrations, tune pools, and rotate slow query logs.
[No sources needed since this section provides general guidance]
Incident Response and Recovery
Escalation path:
- Tier 1: Validate connectivity (Nacos, Kafka, DB).
- Tier 2: Inspect logs and metrics; roll back recent configuration changes.
- Tier 3: Deep-dive into consumer lag, database locks, or AI endpoint failures.
Recovery strategies:
- Graceful shutdown with timeout to drain in-flight work.
- Rollback to previous Nacos DataId versions.
- Recreate consumers with corrected group/topic/partition settings.
Section sources
Conclusion
This guide consolidates practical steps to troubleshoot and maintain the BI Analysis Platform. By leveraging the shared modules’ diagnostics—Nacos configuration, Kafka consumers, and GORM database—you can quickly isolate issues and apply targeted fixes. Adopt the checklists and procedures here to streamline incident response and reduce downtime.
[No sources needed since this section summarizes without analyzing specific files]
Appendices
Diagnostic Tools and Commands
- Nacos: Retrieve and publish configuration programmatically; listen for changes.
- Kafka: Health ping, stats, lag inspection, and graceful shutdown.
- Database: DSN verification, pool tuning, and slow query threshold adjustment.
Section sources