tools · article · 7 min
OpenKB - Open Source Knowledge Base System
May 17, 2026
OpenKB — Open LLM Knowledge Base Scale to long documents • Reasoning-based retrieval • Native multi-modality • No Vector DB 📑 What is OpenKB OpenKB (Open Knowledge Base) is an open-source system (in CLI) that compiles raw documents into a structured, interlinked wiki-style knowledge base using LLMs, powered by PageIndex for vectorless long document retrieval. The idea is based on a concept described by Andrej Karpathy: LLMs generate summaries, concept pages, and cross-references, all maintained automatically. Knowledge compounds over time instead of being re-derived on every query. Why not traditional RAG? Traditional RAG rediscovers knowledge from scratch on every query. Nothing accumulates. OpenKB compiles knowledge once into a persistent wiki, then keeps it current. Cross-references already exist. Contradictions are flagged. Synthesis reflects everything consumed. Features Broad format support — PDF, Word, Markdown, PowerPoint, HTML, Excel, text, and more via markitdown Scale to long documents — Long and complex documents are handled via PageIndex tree indexing, enabling accurate, vectorless long-context retrieval Native multi-modality — Retrieves and understands figures, tables, and images, not just text Compiled Wiki — LLM manages and compiles your documents into summaries, concept pages, and cross-links, all kept in sync Query — Ask questions (one-off) against your wiki. The LLM navigates your compiled knowledge to answer Interactive Chat — Multi-turn conversations with persisted sessions you can resume across runs Lint — Health checks find contradictions, gaps, orphans, and stale content Watch mode — Drop files into raw/ , wiki updates automatically Obsidian compatible — Wiki is plain .md files with [[wikilinks]] . Open in Obsidian for graph view and browsing 🚀 Getting Started Install pip install openkb Other install options Latest from GitHub: pip install git+https://github.com/VectifyAI/OpenKB.git Install from source (editable, for development): git clone https://github.com/VectifyAI/OpenKB.git cd OpenKB pip install -e . Quick Start # 1. Create a directory for your knowledge base mkdir my-kb && cd my-kb # 2. Initialize the knowledge base openkb init # 3. Add documents openkb add paper.pdf openkb add ~ /papers/ # Add a whole directory # 4. Ask a question openkb query " What are the main findings? " # 5. Or chat interactively openkb chat Set up your LLM OpenKB comes with multi-LLM support (e.g., OpenAI, Claude, Gemini) via LiteLLM (pinned to a safe version ). Set your model during openkb init , or in .openkb/config.yaml , using provider/model LiteLLM format (like anthropic/claude-sonnet-4-6 ). OpenAI models can omit the prefix (like gpt-5.4 ). Create a .env file with your LLM API key: LLM_API_KEY=your_llm_api_key 🧩 How OpenKB Works Architecture raw/ You drop files here │ ├─ Short docs ──→ markitdown ──→ LLM reads full text │ │ ├─ Long PDFs ──→ PageIndex ────→ LLM reads document trees │ │ │ ▼ │ Wiki Compilation (using LLM) │ │ ▼ ▼ wiki/ ├── index.md Knowledge base overview ├── log.md Operations timeline ├── AGENTS.md Wiki schema (LLM instructions) ├── sources/ Full-text conversions ├── summaries/ Per-document summaries ├── concepts/ Cross-document synthesis ← the good stuff ├── explorations/ Saved query results └── reports/ Lint reports Short vs. Long Document Handling Short documents Long documents (PDF ≥ 20 pages) Convert markitdown → Markdown PageIndex → tree index + summaries Images Extracted inline (pymupdf) Extracted by PageIndex LLM reads Full text Document trees Result summary + concepts summary + concepts Short docs are read in full by the LLM. Long PDFs are indexed by PageIndex into a hierarchical tree with summaries. The LLM reads the tree instead of the full text, enabling better retrieval from long documents. Knowledge Compilation When you add a document, the LLM: Generates a summary page Reads existing concept pages Creates or updates concepts with cross-document synthesis Updates the index and log A single source might touch 10-15 wiki pages. Knowledge accumulates: each document enriches the existing wiki rather than sitting in isolation. ⚙️ Usage Commands Command Description openkb init Initialize a new knowledge base (interactive) openkb add <file_or_dir> Add documents and compile to wiki openkb remove <doc> Remove a document and clean up its wiki pages, images, registry, and PageIndex state (use --dry-run to preview, --keep-raw / --keep-empty-concepts to retain artifacts) openkb query "question" Ask a question over the knowledge base (use --save to save the answer to wiki/explorations/ ) openkb chat Start an interactive multi-turn chat (use --resume , --list , --delete to manage sessions) openkb watch Watch raw/ and auto-compile new files openkb lint Run structural + knowledge health checks openkb list List indexed documents and concepts openkb status Show knowledge base stats openkb feedback ["msg"] File feedback by opening a prefilled GitHub issue (use --type bug/feature/question to tag the issue) Interactive Chat openkb chat opens an interactive chat session over your wiki knowledge base. Unlike the one-shot openkb query , each turn carries the conversation history, so you can dig into a topic without re-typing context. openkb chat # start a new session openkb chat --resume # resume the most recent session openkb chat --resume 20260411 # resume by id (unique prefix works) openkb chat --list # list all sessions openkb chat --delete < id > # delete a session Inside a chat, type / to access slash commands (Tab to complete): /help — list available commands /status — show knowledge base status /list — list all documents /add <path> — add a document or directory without leaving the chat /save [name] — export the transcript to wiki/explorations/ /clear — start a fresh session (the current one stays on disk) /lint — run knowledge base lint /exit — exit (Ctrl-D also works) Configuration Settings are initialized by openkb init , and stored in .openkb/config.yaml : model : gpt-5.4 # LLM model (any LiteLLM-supported provider) language : en # Wiki output language pageindex_threshold : 20 # PDF pages threshold for PageIndex Model names use provider/model LiteLLM format (OpenAI models can omit the prefix): Provider Model example OpenAI gpt-5.4 Anthropic anthropic/claude-sonnet-4-6 Gemini gemini/gemini-3.1-pro-preview PageIndex Integration Long documents are challenging for LLMs due to context limits, context rot, and summarization loss. PageIndex solves this with vectorless, reasoning-based retrieval — building a hierarchical tree index that lets LLMs reason over the index for context-aware retrieval. PageIndex runs locally by default using the open-source version , with no external dependencies required. Optional: Cloud Support For large or complex PDFs, PageIndex Cloud can be used to access additional capabilities, including: OCR support for scanned PDFs (via hosted VLM models) Faster structure generation Scalable indexing for large documents Set PAGEINDEX_API_KEY in your .env to enable cloud features: PAGEINDEX_API_KEY=your_pageindex_api_key AGENTS.md The wiki/AGENTS.md file defines wiki structure and conventions. It's the LLM's instruction manual for maintaining the wiki. Customize it to change how your wiki is organized. At runtime, the LLM reads AGENTS.md from disk, so your edits take effect immediately. Using with Obsidian OpenKB's wiki is a directory of Markdown files with [[wikilinks]] . Obsidian renders it natively. Open wiki/ as an Obsidian vault Browse summaries, concepts, and explorations Use graph view to see knowledge connections Use Obsidian Web Clipper to add web articles to raw/ 🧭 Learn More Compared to Karpathy's Approach Karpathy's workflow OpenKB Short documents LLM reads directly markitdown → LLM reads Long documents Context limits, context rot PageIndex tree index Supported formats Web clipper → .md PDF, Word, PPT, Excel, HTML, text, CSV, .md Wiki compilation LLM agent LLM agent (same) Q&A Query over wiki Wiki + PageIndex retrieval The Stack PageIndex — Vectorless, reasoning-based document indexing and retrieval markitdown — Universal file-to-markdown conversion OpenAI Agents SDK — Agent framework (supports non-OpenAI models via LiteLLM) LiteLLM — Multi-provider LLM gateway Click — CLI framework watchdog — Filesystem monitoring Roadmap Extend long document handling to non-PDF formats Scale to large document collections with nested folder support Hierarchical concept (topic) indexing for massive knowledge bases Database-backed storage engine Web UI for browsing and managing wikis Contributing Contributions are welcome! Please submit a pull request, or open an issue for bugs or feature requests. For larger changes, consider opening an issue first to discuss the approach. License Apache 2.0. See LICENSE . Support Us If you find OpenKB useful, please give us a star 🌟 — and check out PageIndex too!