Knowledge Graph and Semantic Search
Table of Contents
- Introduction
- Project Structure
- Core Components
- Architecture Overview
- Detailed Component Analysis
- Dependency Analysis
- Performance Considerations
- Troubleshooting Guide
- Conclusion
- Appendices
Introduction
This document explains the knowledge graph and semantic search capabilities implemented in the bi-chat module. It covers Neo4j integration for graph storage and traversal, Milvus-based vector search for entity and metric retrieval, and the hybrid knowledge retrieval pipeline that combines structured graph data with unstructured text. It also documents the graph schema for business metrics and entities, the ontology management system, and practical guidance for performance optimization and troubleshooting.
Project Structure
The knowledge graph and semantic search features are primarily implemented under the bi-chat Python package. Key areas include:
- Graph database client and schema definitions for Neo4j
- Ontology population and retrieval logic
- Vector database integration with Milvus
- Agent and tool integrations for knowledge queries
- Configuration for external systems (Neo4j, Milvus, PostgreSQL, LLM providers)
Diagram sources
- [knowledge_agent.py]
- [knowledge.py]
- [retriever.py]
- [vector_db.py]
- [neo4j_client.py]
- [graph_schema.py]
- [models.py]
- [config.py]
- [knowledge_search_tool.py]
Section sources
- [knowledge_agent.py]
- [knowledge.py]
- [retriever.py]
- [vector_db.py]
- [neo4j_client.py]
- [graph_schema.py]
- [models.py]
- [config.py]
- [knowledge_search_tool.py]
Core Components
- Neo4j client: Provides connection management, query execution, write operations, transactions, health checks, and a global lazy-initialized client.
- Graph schema: Defines node types and relationship types used across the knowledge graph.
- Ontology retriever: Orchestrates semantic search via Milvus and graph expansion via Neo4j to produce contextual answers.
- Milvus helper: Manages collections for indicators and entities, handles indexing and vector search.
- Data models: SQLAlchemy models for indicators and entities used alongside graph and vector stores.
- Configuration: Centralized settings for Neo4j, Milvus, PostgreSQL, Redis, LLM provider, and embedding model/dimension.
- Tools and agents: Public API tool for external knowledge and internal tools for entity/indicator retrieval.
Section sources
- [neo4j_client.py]
- [graph_schema.py]
- [retriever.py]
- [vector_db.py]
- [models.py]
- [config.py]
- [knowledge.py]
- [knowledge_agent.py]
- [knowledge_search_tool.py]
Architecture Overview
The system integrates three pillars:
- Structured graph (Neo4j): Stores business domains, entities, metrics, and their relationships.
- Semantic vectors (Milvus): Indexes indicators and entities for similarity search.
- Hybrid retrieval: Uses vector search to retrieve candidates, then enriches with graph traversal and schema context.
Diagram sources
Detailed Component Analysis
Neo4j Integration
- Client lifecycle: Lazy initialization, context manager support, health checks, and transactional writes.
- Query patterns: Read queries return records as dictionaries; write operations use explicit sessions and transactions; graph expansion uses Cypher with variable-length relationships.
- Connection management: Reads from centralized settings; supports environment overrides.
Diagram sources
Section sources
Graph Schema Design
- Node types: BusinessDomain, SubDomain, BusinessEntity, Metric, Tool, FixedSQLTool, PhysicalTable, Column.
- Relationship types: HAS_SUBDOMAIN, HAS_ENTITY, HAS_METRIC, HAS_TOOL, IMPLEMENTED_BY, RELATED_TO, HAS_COLUMN.
- These types guide graph population and traversal for business-aware expansion.
Diagram sources
Section sources
Ontology Population and Management
- SQL parsing: Extracts table definitions, comments, and column metadata from DDL.
- Graph synchronization: Creates or updates table entities and links them to business entities based on naming heuristics.
- Persistence: Uses Neo4j MERGE semantics to avoid duplicates and updates timestamps.
Diagram sources
Section sources
Semantic Search with Milvus
- Collections: Separate collections for indicators and entities; consistent strong consistency level.
- Indexing: IVF_FLAT index with configurable nlist; metric type L2.
- Search: Loads collection, executes vector search with nprobe, and returns hits with entity identifiers.
Diagram sources
Section sources
Knowledge Retrieval Pipeline
- Embedding generation: Async OpenAI-compatible client used to embed queries.
- Vector recall: Milvus search returns candidate names for indicators and entities.
- Graph expansion: Cypher traversal expands seeds into business domain context, sibling entities, and related physical tables.
- Context assembly: Formats indicator definitions and entity schemas; aggregates domain context and relationships.
- Agent response: KnowledgeAgent composes structured answers using retrieved context.
Diagram sources
Section sources
External Knowledge Augmentation
- Public API tool: Wikipedia search with caching and fallback between Chinese and English.
- Use case: Enrich answers with authoritative external knowledge when domain-specific graph coverage is insufficient.
Diagram sources
Section sources
Dependency Analysis
- Configuration-driven: All clients read from centralized settings, enabling environment-specific overrides.
- Client coupling: OntologyRetriever depends on MilvusHelper, Neo4jClient, and SQLAlchemy models.
- Cohesion: Each component encapsulates a single responsibility—Neo4j for graph, Milvus for vectors, retriever for orchestration.
Diagram sources
- [config.py]
- [neo4j_client.py]
- [vector_db.py]
- [retriever.py]
- [models.py]
- [knowledge_agent.py]
- [knowledge.py]
- [knowledge_search_tool.py]
Section sources
- [config.py]
- [retriever.py]
- [neo4j_client.py]
- [vector_db.py]
- [models.py]
- [knowledge_agent.py]
- [knowledge.py]
- [knowledge_search_tool.py]
Performance Considerations
- Milvus index tuning: Adjust nlist and metric type to balance recall and latency; monitor nprobe impact on search speed.
- Embedding model and dimension: Ensure embedding dimensions match collection schema to avoid mismatches.
- Graph traversal limits: Cypher queries include LIMIT clauses to cap result sets; tune for domain size and performance.
- Connection pooling and reuse: Neo4j driver manages connections; avoid frequent reconnects by reusing the global client.
- Asynchronous embedding: Use async embedding calls to overlap I/O with computation.
- Caching: Leverage built-in caching for external knowledge retrieval to reduce repeated network calls.
[No sources needed since this section provides general guidance]
Troubleshooting Guide
- Neo4j connectivity: Verify URI, credentials, and database name; use health checks to confirm connectivity.
- Milvus dimension mismatch: If collection dimension differs from configured embedding dimension, queries may fail; migrate schema instead of dropping data.
- Empty results: For entity search, ensure vector embeddings exist and Milvus is loaded; for graph expansion, confirm seed entities exist and relationships are populated.
- Configuration issues: Confirm environment variables for Neo4j, Milvus, and embedding model/dimensions are set correctly.
Section sources
Conclusion
The bi-chat knowledge graph and semantic search system combines a structured Neo4j graph with Milvus vector embeddings to deliver precise, context-rich answers. The hybrid approach leverages vector recall for broad relevance and graph expansion for domain-aware enrichment. With modular components, centralized configuration, and robust error handling, the system supports scalable, real-time knowledge retrieval across business metrics and entities.
[No sources needed since this section summarizes without analyzing specific files]
Appendices
Example Queries and Patterns
- Business domain-centric expansion: Seed with a business entity; expand to sibling entities within the same domain and related physical tables.
- Metric definition retrieval: Use indicator embeddings to find semantically similar metrics; enrich with definitions and formulas.
- Hybrid search: Combine vector candidates with graph-derived relationships and schema context for comprehensive answers.
[No sources needed since this section provides general guidance]