
CocoIndex: Turn Your Docs into a Smart Knowledge Graph with AI & Neo4j
CocoIndex builds knowledge graphs from documents using LLMs, extracting relationships and entities, then exports them to Neo4j for querying.
Building Knowledge Graphs from Documents with LLM using CocoIndex
Key Concepts:
- CocoIndex: A tool for building and maintaining knowledge graphs with continuous source updates.
- LLM (Large Language Model): Used to extract relationships between concepts in documents.
- Knowledge Graph: A structured representation of knowledge, showing relationships between entities.
Types of Relationships Extracted:
- Subject-Object Relationships: E.g., "CocoIndex supports Incremental Processing."
- Entity Mentions: E.g., "core/basics.mdx" mentions
CocoIndex
andIncremental Processing
.
Prerequisites:
- PostgreSQL (for CocoIndex's incremental processing)
- Neo4j (graph database)
- OpenAI API key (or Ollama for local LLM models)
Data Flow:
- Add Documents as Source: CocoIndex documentation markdown files (
.md
,.mdx
). - Add Data Collectors:
document_node
: Collects documents (e.g.,core/basics.mdx
).entity_relationship
: Collects relationships (e.g., "CocoIndex supports Incremental Processing").entity_mention
: Collects mentions of entities in a document (e.g.,core/basics.mdx
mentionsCocoIndex
).
- Process Each Document and Extract Summary: Use
cocoindex.functions.ExtractByLlm
withgpt-4o
to get document title and summary. - Extract Relationships: Use
cocoindex.functions.ExtractByLlm
to extract relationships (subject, predicate, object) from the document. - Collect Relationships: Collect relationships between subjects and objects and mentions of entities in the document.
- Build Knowledge Graph: Export nodes and relationships to Neo4j.
Building the Knowledge Graph in Neo4j:
- Nodes: Represent entities (e.g., Document, Entity). Each node needs a label and a primary key field.
- Relationships: Connect nodes.
Exporting to Neo4j:
- Configure Neo4j Connection: Specify URI, user, and password.
- Export
Document
Nodes: Export from thedocument_node
collector, usingfilename
as the primary key. - Export
RELATIONSHIP
andEntity
Nodes: DeclareEntity
nodes and then export relationships from theentity_relationship
collector. - Export
entity_mention
: CreatesDocument
nodes andEntity
nodes and connects them withMENTION
relationships.
Code Snippets:
-
Adding a source:
flow_builder.add_source(cocoindex.sources.LocalFile(path="../../docs/docs/core", included_patterns=["*.md", "*.mdx"]))
-
Extracting summary:
doc["summary"] = doc["content"].transform(cocoindex.functions.ExtractByLlm(llm_spec=cocoindex.LlmSpec(api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"), output_type=DocumentSummary, instruction="Please summarize the content of the document."))
-
Extracting relationships:
doc["relationships"] = doc["content"].transform(cocoindex.functions.ExtractByLlm(llm_spec=cocoindex.LlmSpec(api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4o"), output_type=list[Relationship], instruction=("Please extract relationships from CocoIndex documents. " "Focus on concepts and ignore examples and code. ")))
-
Exporting Document nodes to Neo4j:
document_node.export("document_node", cocoindex.storages.Neo4j(connection=conn_spec, mapping=cocoindex.storages.Nodes(label="Document")), primary_key_fields=["filename"])
-
Exporting relationships and entities to Neo4j:
entity_relationship.export("entity_relationship", cocoindex.storages.Neo4j(connection=conn_spec, mapping=cocoindex.storages.Relationships(rel_type="RELATIONSHIP", source=cocoindex.storages.NodeFromFields(label="Entity", fields=[cocoindex.storages.TargetFieldMapping(source="subject", target="value")]), target=cocoindex.storages.NodeFromFields(label="Entity", fields=[cocoindex.storages.TargetFieldMapping(source="object", target="value")])), primary_key_fields=["id"])
-
Exporting the
entity_mention
to Neo4j:entity_mention.export("entity_mention", cocoindex.storages.Neo4j(connection=conn_spec, mapping=cocoindex.storages.Relationships(rel_type="MENTION", source=cocoindex.storages.NodesFromFields(label="Document", fields=[cocoindex.storages.TargetFieldMapping("filename")]), target=cocoindex.storages.NodesFromFields(label="Entity", fields=[cocoindex.storages.TargetFieldMapping(source="entity", target="value")]))), primary_key_fields=["id"])
Running the Index:
-
Install dependencies:
pip install -e .
-
Setup and update the index:
python main.py cocoindex setup python main.py cocoindex update
Querying the Knowledge Graph:
You can explore the knowledge graph in Neo4j Browser using Cypher queries.
MATCH p=()-->() RETURN p
``