Knowledge Graph Across 46 Repos Using Static Analysis

When Your Codebase Outgrows What AI Can See

Modern software systems are rarely tidy. For teams that have been building in production for years, the reality is a sprawling ecosystem of services, frameworks, and interdependencies that no single engineer — and no single AI prompt — can fully hold in mind at once. This is exactly the challenge that Ryan Tsuji, CTO at airCloset, set out to solve when he built what his team calls code-graph: a unified knowledge graph spanning 46 separate repositories, constructed entirely through static analysis.

The project, completed between January and March of this year, offers a candid look at what it actually takes to make a large, multi-repository production codebase legible — not just to humans, but to the AI tools that are increasingly expected to reason about it. This article dives into the motivation, the methodology, and the hard-won lessons from that three-month effort.

The Problem: A Production Codebase That Grew Up in Layers

Long-running production systems tend to accumulate complexity in a very specific way. Multiple teams touch the same codebase over time, each era leaving behind its framework of choice. At airCloset, that meant a technology stack that included jQuery, AngularJS, Express, NestJS, TypeORM, Redux, and Axios all coexisting — not as a clean migration story, but as living, active layers that still serve real users.

What makes this especially difficult to reason about is not the variety of frameworks in isolation, but the way dependencies cross repository boundaries. Consider three common patterns that emerge in systems like this:

API dependencies (n:1): The same API endpoint gets called from multiple repositories. Tracing who calls what requires looking across repos simultaneously, not one at a time.
Database dependencies (n:n): The same database table is read from and written to across multiple services. A change to a schema can have ripple effects that are invisible unless you map every point of contact.
Event dependencies: Looking only at where an event is emitted tells you almost nothing about how completely the subscribe side is covered. The coverage gap is practically untraceable without a holistic view.

These are not edge cases. They are the normal operating reality of a mature microservices architecture, and they represent exactly the kind of structural knowledge that gets lost as teams grow and codebases age.

Why "Just Letting AI Read the Code" Falls Short

There is a tempting shortcut that many engineering teams reach for when facing this kind of complexity: feed the code to a large language model and ask it questions. This approach works reasonably well for understanding a single file, a single service, or even a small cluster of related modules. But it breaks down precisely at the boundaries where the most important architectural decisions live.

AI models, including the most capable ones available today, operate within context windows. When your codebase spans 46 repositories and the critical dependencies are the ones that jump between them, no single context window captures the full picture. An AI reading repository A has no reliable way to know that repository B writes to the same database table, or that repository C is the only subscriber to an event that A emits.

This is the core insight that drove the code-graph project: the connections that cross repository boundaries are not incidental details. They are often the most architecturally significant facts about a system, and they are exactly what gets lost when you analyze code repository by repository.

Static Analysis as the Foundation

To solve this, Ryan turned to static analysis — the practice of extracting structural information from source code without executing it. Static analysis is not a new idea, but applying it systematically across 46 repositories with a heterogeneous mix of frameworks required significant customization and iteration.

The goal was to extract boundaries: the points at which one service reaches out to another, whether through an HTTP call, a database query, or an event. By identifying these boundaries across every repository and then linking them together, it becomes possible to construct a graph where nodes represent components — files, modules, API routes, database tables, event channels — and edges represent the dependencies between them.

The result is code-graph: a single, queryable representation of the entire system's structure, built from the ground up through automated analysis rather than manual documentation.

What Three Months of Trial and Error Taught

Building code-graph was not a linear process. Three months of iteration uncovered both what static analysis can do well and where it reaches its limits. On the positive side, the approach proved capable of reliably mapping explicit structural relationships: which files import which modules, which API endpoints are defined where, which database models are referenced by which services.

The harder problems were in the gaps — the implicit knowledge that lives in naming conventions, runtime behavior, and the accumulated tribal knowledge of a team. Static analysis can tell you that a function exists and where it is called. It cannot always tell you why it was written that way, or what business rule it encodes, or which product feature would break if it changed.

These limitations are not a failure of the approach. They are a clear signal that static analysis is a foundation, not a complete solution. Recognizing this led to the next phase of the project: building a layer called service-product-graph (SPG) on top of code-graph, designed to fill in exactly the gaps that static analysis leaves behind. That work is the subject of Part 2.

Why This Kind of Tooling Matters Now

As AI-assisted development becomes a standard part of engineering workflows, the quality of the context that AI tools can access becomes a critical infrastructure concern. A knowledge graph like code-graph is not just a visualization tool for human engineers. It is a structured, machine-readable representation of how a system actually works — the kind of artifact that makes it possible for AI tools to reason about a codebase at the level of the whole system, not just the level of individual files.

For teams managing large, multi-repository codebases, investing in this kind of structural understanding is increasingly the difference between AI assistance that feels genuinely useful and AI assistance that confidently misses the point. The airCloset code-graph project is a detailed, practical example of what that investment looks like in practice — and what it takes to get it right.