CocoIndex: Turn Your Docs into a Smart Knowledge Graph with AI & Neo4j

CocoIndex builds knowledge graphs from documents using LLMs, extracting relationships and entities, then exports them to Neo4j for querying.

Building Knowledge Graphs from Documents with LLM using CocoIndex

Key Concepts:

CocoIndex: A tool for building and maintaining knowledge graphs with continuous source updates.
LLM (Large Language Model): Used to extract relationships between concepts in documents.
Knowledge Graph: A structured representation of knowledge, showing relationships between entities.

Types of Relationships Extracted:

Subject-Object Relationships: E.g., "CocoIndex supports Incremental Processing."
Entity Mentions: E.g., "core/basics.mdx" mentions CocoIndex and Incremental Processing.

Prerequisites:

PostgreSQL (for CocoIndex's incremental processing)
Neo4j (graph database)
OpenAI API key (or Ollama for local LLM models)

Data Flow:

Add Documents as Source: CocoIndex documentation markdown files (.md, .mdx).
Add Data Collectors:
- document_node: Collects documents (e.g., core/basics.mdx).
- entity_relationship: Collects relationships (e.g., "CocoIndex supports Incremental Processing").
- entity_mention: Collects mentions of entities in a document (e.g., core/basics.mdx mentions CocoIndex).
Process Each Document and Extract Summary: Use cocoindex.functions.ExtractByLlm with gpt-4o to get document title and summary.
Extract Relationships: Use cocoindex.functions.ExtractByLlm to extract relationships (subject, predicate, object) from the document.
Collect Relationships: Collect relationships between subjects and objects and mentions of entities in the document.
Build Knowledge Graph: Export nodes and relationships to Neo4j.

Building the Knowledge Graph in Neo4j:

Nodes: Represent entities (e.g., Document, Entity). Each node needs a label and a primary key field.
Relationships: Connect nodes.

Exporting to Neo4j:

Configure Neo4j Connection: Specify URI, user, and password.
Export Document Nodes: Export from the document_node collector, using filename as the primary key.
Export RELATIONSHIP and Entity Nodes: Declare Entity nodes and then export relationships from the entity_relationship collector.
Export entity_mention: Creates Document nodes and Entity nodes and connects them with MENTION relationships.

Code Snippets:

Adding a source:

flow_builder.add_source(cocoindex.sources.LocalFile(path="../../docs/docs/core", included_patterns=["*.md", "*.mdx"]))

Extracting summary:

doc["summary"] = doc["content"].transform(cocoindex.functions.ExtractByLlm(llm_spec=cocoindex.LlmSpec(api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"), output_type=DocumentSummary, instruction="Please summarize the content of the document."))

Extracting relationships:

doc["relationships"] = doc["content"].transform(cocoindex.functions.ExtractByLlm(llm_spec=cocoindex.LlmSpec(api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"), output_type=list[Relationship], instruction=("Please extract relationships from CocoIndex documents. " "Focus on concepts and ignore examples and code. ")))

Exporting Document nodes to Neo4j:

document_node.export("document_node", cocoindex.storages.Neo4j(connection=conn_spec, mapping=cocoindex.storages.Nodes(label="Document")), primary_key_fields=["filename"])

Exporting relationships and entities to Neo4j:

entity_relationship.export("entity_relationship", cocoindex.storages.Neo4j(connection=conn_spec, mapping=cocoindex.storages.Relationships(rel_type="RELATIONSHIP", source=cocoindex.storages.NodeFromFields(label="Entity", fields=[cocoindex.storages.TargetFieldMapping(source="subject", target="value")]), target=cocoindex.storages.NodeFromFields(label="Entity", fields=[cocoindex.storages.TargetFieldMapping(source="object", target="value")])), primary_key_fields=["id"])

Exporting the entity_mention to Neo4j:

entity_mention.export("entity_mention", cocoindex.storages.Neo4j(connection=conn_spec, mapping=cocoindex.storages.Relationships(rel_type="MENTION", source=cocoindex.storages.NodesFromFields(label="Document", fields=[cocoindex.storages.TargetFieldMapping("filename")]), target=cocoindex.storages.NodesFromFields(label="Entity", fields=[cocoindex.storages.TargetFieldMapping(source="entity", target="value")]))), primary_key_fields=["id"])

Running the Index:

Install dependencies: pip install -e .

Setup and update the index:

python main.py cocoindex setup
python main.py cocoindex update

Querying the Knowledge Graph:

You can explore the knowledge graph in Neo4j Browser using Cypher queries.

MATCH p=()-->() RETURN p
``