# I Built an AI That Understands Any GitHub Repo Using LangChain and ChromaDB
Why I Built This Every time I join a new codebase, the first few days are the same: open the repo, stare at folders, try to figure out which service does what, read half a file, get interrupted, lo...

Source: DEV Community
Why I Built This Every time I join a new codebase, the first few days are the same: open the repo, stare at folders, try to figure out which service does what, read half a file, get interrupted, lose context, start over. GitHub's built-in search is keyword-only. ChatGPT has never seen your repo. Teammates are busy. Documentation is either missing or out of date. I wanted a tool that could answer "how does checkout work?" from the actual code — not from training data, not from docs, but from the real source files. So I built one. How It Works The system is built around a RAG (Retrieval-Augmented Generation) pipeline. The idea: instead of asking an LLM to answer from memory, you first retrieve the most relevant code chunks, then ask the LLM to answer using only those chunks. Ingest flow: Clone the GitHub repo locally Walk every file and split into overlapping chunks (~500 tokens, 50-token overlap) Convert each chunk to a vector embedding using all-MiniLM-L6-v2 (Sentence Transformers — lo