The article traces Alkira’s effort to build an internal enterprise assistant capable of answering complex questions across scattered corporate knowledge, from wikis and tickets to design docs and support channels. Early attempts focused on fine-tuning open source models such as Llama3- 8B, Qwen3-8B, and Mistral-7B using generated question and answer pairs from documentation. These models struggled with vague queries and only produced reliable answers when prompts were phrased very specifically, highlighting the limitations of fine-tuning as a substitute for queryable memory. Without direct grounding in source documents, responses were brittle and prone to hallucinations, which made this strategy unsuitable for a broad, dynamic knowledge base that also had strict data sovereignty requirements.
Moving beyond fine-tuning, the team turned to retrieval augmented generation, initially experimenting with existing open source frameworks. They adopted LightRAG with Neo4j and Qdrant in prototype v0.1 and saw encouraging results on a small corpus. In prototype v0.2 they customized LightRAG with semantic chunking, contextual embeddings, and dynamic entity extraction, but when the system grew from ~100 to ~3000 documents, retrieval quality collapsed due to context pollution, where too many irrelevant or tangential documents diluted the context fed to the generator model. This experience led Alkira to conclude that generic frameworks lacked the granular control over ingestion and retrieval necessary for complex enterprise environments. A subsequent proof of concept, v0.3, combined FalkorDB with graphrag-sdk, Qdrant, and a FastAPI backend using gemini-2.5-flash, along with dual dense and sparse vectors, and delivered outstanding quality with ~5,000 documents, validating a hybrid graph plus vector approach and revealing that a single high-quality dense vector within FalkorDB would be sufficient.
These lessons informed the production-ready architecture, AKGPT v0.4, built around FalkorDB as a single, in-memory engine that stores both a knowledge graph and vector embeddings to minimize latency and complexity. A manually triggered ingestion pipeline, orchestrated via a Redis queue, uses a language model to filter out non-technical noise and personally identifiable information, performs semantic chunking, extracts entities and relationships under a defined schema, generates Qwen3-embedding-8b vectors with 4096 dimensions, and creates hybrid links between conceptual nodes and their source text. At query time, an enhancement step reframes user prompts using a rich system prompt before dispatching them to two parallel retrieval paths: precise graph traversal based on extracted entities and semantic vector search, both running over FalkorDB. Retrieved results from each path are passed to a dedicated Qwen3-reranker-8b model, which scores relevance and returns the top 10 items per path, and the top 20 unique chunks are then synthesized into an answer by gemini-2.5-flash. The system now handles both vague conceptual questions and specific operational commands while keeping all data inside Alkira’s infrastructure. The article notes that this approach demands significant custom development, careful management of FalkorDB’s RAM costs, multi-model latency, and schema evolution, and outlines a roadmap that includes automated ingestion from tools like Jira, Confluence, and Slack, temporal and multi-hop graph retrieval, more agentic orchestration for live data and actions, corrective retrieval augmented generation loops, and caching for frequently asked questions.
