Introduction

Written After Revit 2027 Release

Autodesk Revit 2027 AI Assistant has been recently updated. The new AI capabilities can now break down user requirements more accurately, guide users in selecting required data, and have also opened up integration with tools like Claude and Cursor.

Because of this, I hope this project can offer some inspiration to engineers without a programming background: even without writing traditional code directly, you can try building your own Tools and actually applying them to Revit workflows.

Of course, this article also suffers a bit from “procrastination” — it dragged on until after Revit 2027 was released. During that time, some of the technical directions I originally intended to discuss have already been covered by official capabilities. :)

Framework Introduction

If you’ve used Claude, GPT-5, Gemini, or other mainstream large models for coding recently, you’ve probably noticed: the code generation capability of the models themselves is no longer the bottleneck. For most languages and most frameworks, writing syntactically correct, well-structured code that handles edge cases — these models reach 90–95% accuracy. But once you ask them to write Revit code, errors start to appear.

It’s not because the models can’t write C#. It’s because they can’t reach the data inside your project, or the accurate API. They can’t actually open your .rvt file, can’t see which family types are loaded in your project, don’t know which element you currently have selected, and certainly don’t know what your company’s naming conventions look like.

When facing the situation of “I don’t know,” the model has a stubborn default behavior — it guesses, and it guesses with confidence.

Here’s an example I’ve observed countless times. You ask it to “create a 300x300 concrete column,” and the model will likely write code like this:

1
2
3
4
5
var symbol = new FilteredElementCollector(doc)
.OfClass(typeof(FamilySymbol))
.OfCategory(BuiltInCategory.OST_StructuralColumns)
.Cast<FamilySymbol>()
.FirstOrDefault(s => s.Name.Contains("300x300"));

This code looks perfectly reasonable. 300x300 is an extremely common concrete column dimension in Revit projects, appearing in vast amounts of training data. The model isn’t fabricating a completely outrageous value — it’s using the most common value from training data as a default assumption.

But your project might not have a family type called “300x300” at all. Your company might use “KZ-300×300”, or “RectColumn_300_300”, or “Rect-300x300mm”. The model doesn’t know, so it picks the one that “looks most like it” — and the result at runtime is null, or worse, a family that happens to have “300x300” in its name but isn’t the one you wanted.

This is the core failure mode of AI writing Revit code: it’s not that the code is wrong, it’s that the default values are wrong; not that the model can’t write, but that it chose to “guess” where data was missing. This kind of error is hidden, high-probability, and expensive to debug — the code compiles, runs, and sometimes even produces results that look correct.

revit-api-rag solves exactly this. It’s not trying to make AI better at writing code — AI is already good enough. It does something else:

As a bridge between AI and users, it delivers the 5–10% of real data inside Revit that the model can’t reach — structurally, reliably, and with low latency — into the generation process, so the AI no longer needs to “guess”.

Specifically, the framework does four things.

First, it acts as a bridge between AI and users. User requests are often vague (“create a column”), but AI output requires precision (which family, which level, which coordinates). The information gap in between either gets filled by repeated user clarification, or by the AI guessing from training data. Neither is good. The framework’s bridging role is to automatically inject real data from the project, so users don’t have to clarify repeatedly and the AI doesn’t have to guess.

Second, it provides more accurate context for final generation. Through RAG retrieval of real API signatures and SDK examples, plus queries against the live Revit document, what gets injected into the prompt is no longer “Revit as the model remembers it” but “Revit as it actually exists in your project.”

Third, different models do different jobs. Embedding uses OpenAI’s text-embedding-3-large, rerank uses Cohere rerank-v3.5, main generation uses Gemini, and intent classification with lightweight tasks uses smaller models. Each step uses the most suitable model — no one model is forced to solve everything. This combination is both cheaper and more accurate than “all GPT-4” or “all Claude”.

Fourth, the framework itself is a learning sample. If you’re also building a “domain knowledge + AI auto-programming” system, the implementation here — raw material pruning and cleaning, embedding pipeline, Agent invocation, Workflow orchestration, Skill encapsulation, Tool generation and reuse — keeps each piece as independent as possible, so they can be referenced or extracted separately.

Overall architecture: bridging AI and Revit project data


1. Letting “People with Logic but No Code” Participate in Programming

Once AI code accuracy is pushed beyond 95%, can people without programming skills but with deep domain expertise finally participate in automation work?

The Reality of the Design Industry

In my BIM / structural engineering industry, the vast majority of practitioners do “parametric” work every day — they just don’t write code:

  • The structural engineer’s logic when calculating reinforcement: looking up codes, applying formulas, getting results based on span, load, seismic intensity
  • The architect’s logic for facade design: modular grids, alignment, shadow analysis, height limits
  • The interior designer’s logic for cabinetry: ergonomic dimensions, hardware positions, panel layout
  • The MEP engineer’s logic for piping: pipe diameter calculations, clash avoidance, slope requirements

These people all have very clear “algorithmic thinking.” They do “input parameters → apply rules → get output” every day — which is essentially the programmer’s working mode.

But they get stuck on something completely outside their professional domain: the syntax of programming languages. To turn “beam reinforcement logic” into a Revit C# automation script requires learning OOP, the Revit API, LINQ, Transaction patterns — these are entirely separate disciplines from “understanding structure.”

For the past thirty years, there have been only two ways to solve this problem:

  1. Learn programming: A few engineers with interest and time make the transition, but the cost is extremely high, and 80% of their time goes into “syntax” rather than “domain”
  2. Find a developer: The company hires a customization team, the engineer verbally describes the requirements, the developer writes the code — back to the inefficient loop described in the introduction

A third path — “AI translates my domain language into code” — has always existed, but accuracy was too low. 60% is essentially unusable in professional scenarios, because the remaining 40% failures still need someone who understands code to fix, and the requester still has to find a developer.

What AI Coding Changes

When AI Coding stably reaches 95%+, a qualitative change happens:

The 5% that fails no longer needs “people who understand code” — it needs “people who understand the domain.”

Most of those 5% failures aren’t C# syntax errors — they’re domain intent unclear. For example, a user says “beams shouldn’t be too dense” but doesn’t specify “density should not exceed what.” This kind of problem doesn’t need a programmer to debug code; it just needs a BIM engineer to clarify the requirement a bit more, and the AI regenerates.

At this moment, “whether you can write code” is no longer the threshold for participating in programming work. The threshold becomes:

  • Can you articulate your domain logic clearly?
  • Can you tell whether something failed because the logic is wrong or the expression is wrong?
  • Can you verify whether the AI’s output complies with your domain norms?

These are exactly the strongest abilities of senior domain experts — they’ve been doing this for twenty years; logic, judgment, and verification have long been internalized.

This Framework Solves the “Cross-Industry Programming” Problem

I now prefer a more accurate description of this framework:

It’s not about AI replacing programmers — it’s about letting a structural engineer directly invoke code that only a programmer could write, without having to become a programmer first.

Every industry has its own “code” — structural engineers use code formulas, accountants use Excel functions, lawyers use legal citations, doctors use diagnostic flowcharts. These are all different forms of “programs”. But real code (C#, Python, JavaScript) is just one of them — it’s the code computers read.

Designers, engineers, doctors — they’ve already mastered their industry’s “code.” What they lack isn’t logical capability, it’s the ability to translate their industry’s code into computer code.

What AI does at 95%+ accuracy is essentially this translation. What needs to be done to push translation quality to that level — RAG, Dynamic Choices, Tool Solidification, Skills, resistance training, and so on — is what the rest of this article covers. But once translation is done well, those “with logic but without code” can really participate in automation work for the first time — not as requesters, but as authors.

The Vision

It means senior practitioners in every industry, without learning to program, can directly turn decades of accumulated domain knowledge, normative judgment, and professional intuition into executable tools — for themselves, for peers, and for those who come after them.

A structural engineer can package their “reinforcement judgment logic” into a tool, and a junior team member calling it once can replicate twenty years of experience.

An architect can package their “facade aesthetic rules” into a tool for other departments to use.

An interior designer can package their “cabinet design know-how” into a tool that automatically adapts to different layouts.

The framework has two layers specifically for this:

Tool Solidification lets the designer “do it once” successfully, and the AI solidifies it into a reusable template;

Skills lets the designer “write a rule,” and the AI follows it automatically when doing related tasks. The two paths complement each other — the former captures execution steps, the latter captures judgment principles — and through tool libraries and rule libraries they spread within teams and across the industry.

It amplifies a person’s domain expertise rather than replacing it.


2. Why AI Fails at Writing Revit Code

In the introduction I mentioned that AI “guesses” where data is missing. In this section I’ll break down this root cause. It manifests in three typical forms in actual code, which look different but are essentially the same thing.

Form 1: Using “Common Values” from Training Data as Defaults

This is an extension of the 300x300 column example from the introduction. It doesn’t only happen with family types — it happens in many other places.

The model often writes code like this:

1
2
3
4
5
6
7
8
9
// Defaulted to a "reasonable-looking" level
var level = collector.Cast<Level>()
.FirstOrDefault(l => l.Name == "1F" || l.Name == "Level 1");

// Defaulted to a "usually like this" floor height
var height = 3000; // 3000mm is the most common floor height in training data

// Defaulted to a "standard" column spacing
var spacing = 6000; // Equally common

These values aren’t fabricated out of thin air. They are all values that frequently appear in real Revit projects — so the model learned from training data that “when creating columns, spacing is around 6000, floor height is around 3000, levels are usually called Level 1 or 1F.”

The problem is: users’ projects don’t necessarily conform to these “statistical norms.” A user’s company level naming might be “B1F / 1F / 2F”, might be “L01 / L02”, might even be Chinese “首层 / 二层”. The model’s default-value hit rate based on “most common in training data” is much lower than imagined.

And this kind of error is the hardest to debug — the code runs, produces results, just the wrong ones.

Form 2: .First() Pretending to Choose

When the model is forced to pick one from a collection but doesn’t know which, it writes:

1
2
3
4
var symbol = collector.OfClass(typeof(FamilySymbol))
.OfCategory(BuiltInCategory.OST_StructuralColumns)
.Cast<FamilySymbol>()
.First();

What .First() does here is “I must give a value, so I take the first from the collection.” But what the user actually wants is a specific family type, not “the first in the collection”.

This code isn’t wrong syntactically, it’s wrong because it pretends to be making a choice when it’s actually concealing “a fact the model didn’t know”. Semantically it’s the same as the previous “guess common values” — both are the model using a “reasonable-looking” way to evade the fact of “I don’t know.”

Form 3: Fabricating Reasonable-Looking API Signatures

1
2
3
4
5
6
7
8
9
// Model generates
Wall.Create(doc, line, levelId);

// Real signature requires 8 parameters
public static Wall Create(
Document document, Curve curve,
ElementId wallTypeId, ElementId levelId,
double height, double offset,
bool flip, bool structural);

This is the same category — the model remembers the name Wall.Create but can’t recall the exact parameter order, so it fills in based on “what this kind of API usually looks like”. In a library like Revit with massive API surface area and signatures that change between versions, this “guess by experience” probability is not low.

And One More: Ignoring the Execution Environment

There’s one more category of failure that’s slightly different from the above three — it’s not “guessed wrong,” it’s “didn’t think of it at all”:

  • The model writes using (Transaction t = new Transaction(...)), but the plugin already wraps everything in a transaction → nested error
  • Treats 6000 as feet and passes it directly to the API, ending up creating a 1800-meter-tall wall
  • Uses FamilySymbol directly without Activate(), throwing an exception on call

The root cause of this category isn’t “missing data” — it’s “the model doesn’t know what unwritten constraints this execution environment has.” But the solution is similar: things that should be conventions of the system should not be left to the model to judge on the spot.

The Common Root Cause

Putting these four manifestations together, the root cause is one sentence:

When the model is asked to make decisions it can’t make, it fills in with “common patterns” from training data.

What it fills in looks reasonable, because it really is statistically common. But “statistically common” and “actually exists in your project” are completely different things.

That’s why simply changing the prompt doesn’t help — you can tell the model “don’t guess,” but you can’t tell it “what actually exists in your project.” That has to be filled in at the system level.

Three forms of "guessing" + one form of "missing thought" — all rooted in missing data


3. Treat RAG as Trust Infrastructure, Not “Adding Some Material”

Many people understand RAG as “adding background knowledge to the model.” But in this kind of auto-programming system, RAG’s real role is something else:

It draws a trustworthy boundary for code generation.

The model isn’t free-styling — it’s combining within a set of validated, structured, cleaned API candidates. What it can write depends on what the retrieval stage put into the context.

This shift in perspective changes all subsequent design:

  • If RAG is “adding material,” then stuffing documents into the prompt is enough;
  • If RAG is “drawing boundaries,” then you must consider: how do you ensure what’s retrieved is real API? How do you ensure the version is correct? How do you ensure related objects (Level, FamilySymbol, BuiltInCategory) are also brought along?

This is why I didn’t use the simplest “chunk + embedding” approach, but built a four-layer pipeline.

RAG tightens generation boundaries: from "free play" to "combining within candidates"


4. Two Kinds of Data for the Model: API Knowledge + Project State

Now we have RAG’s “boundary-drawing” mindset. The rest is engineering: how to draw the right boundary.

Specifically, two things —

  1. Inject API knowledge into the model via the RAG pipeline (does this method exist, what’s its signature)
  2. Inject project state into the model via Dynamic Choices (what actually exists in your project)

Both are necessary. The former addresses “the model can’t accurately remember API details”; the latter addresses “the model can’t reach runtime data” — which is the root cause of the 300x300 problem in the introduction.

RAG Pipeline: What the Four Layers Do

The core idea of the entire pipeline is to push “the unreliable parts” to the offline stage, leaving runtime to do only lightweight, stable work:

1
Natural language → Query Rewrite → Dual Search → Rerank → Hydrate → Inject into generation

Each layer solves a specific problem:

  • Query Rewrite: The user says “create a wall,” but the vector store contains Wall.Create, WallType, Curve. First let the LLM rewrite user language into an expanded query containing API keywords, bridging the “language gap.”
  • Dual Search: API documentation (revit_api) and SDK examples (revit_sdk) are recalled in parallel from two independent indexes. The former handles “does this method exist” (constraint), the latter handles “how is this usually done” (demonstration). Their roles are different and shouldn’t be mixed.
  • Rerank: Vector similarity isn’t semantic relevance. Cohere’s rerank model filters “close enough” into “right.” Empirically this step improves final quality more than swapping in a larger embedding model.
  • Hydrate: The vector store only stores ids and embeddings. Full fields (method signatures, parameter lists, remarks, related examples) are fetched from SQLite — vector store as the recall entry point, SQLite as the complete storage, with clear separation of responsibilities.

The design goal isn’t “recall a lot,” it’s “recall accurately, and precisely reflect what the API actually looks like.”

Four-layer retrieval pipeline

Dynamic Choices

But RAG can’t solve one class of problem — the model doesn’t know what’s in the .rvt project you currently have open.

Consider this request: “create two structural columns on the first floor.”

The code the model needs to generate uses at least two “project-level entities”: a specific FamilySymbol (structural column family type), and a specific Level (first floor). These two things don’t exist in API documentation, nor in SDK examples — they exist in the user’s currently open Revit document.

If the model doesn’t know these, it can only fabricate a name, use .First() to grab one out of thin air, or treat it as a required parameter for the user to fill in — none of these are ideal. This is the root cause of the 300x300 failure mode in the introduction.

Dynamic Choices’ approach is: before invoking the LLM to generate code, query the current document via the plugin first, and pass the “actually available candidates in the project” into the model as context:

1
2
3
4
5
6
7
8
9
10
11
User: "create two structural columns on the first floor"

Intent recognition: needs FamilySymbol(structural column) and Level(first floor)

Query the live Revit document:
- FamilySymbols: ["W12X65", "HSS6X6X3/8", "RectColumn-300x300"]
- Levels: ["B1", "1F", "2F", "Roof"]

Inject prompt: these are real candidates that exist

LLM generates code (referencing real ElementId or asking the user to choose)

Replace “guess” with “query.” The generated code no longer relies on .First(), but directly references real existing elements.

The more important side effect: when the context says “the project has W12X65 / HSS6X6X3/8 / RectColumn-300x300,” the model won’t fabricate a non-existent family type name. Its output space is physically tightened by real data — this is more effective at resisting model default behavior than any prompt-level instruction (this mechanism gets its own chapter, Chapter 8).

Dynamic Choices: replace "guessing" with "querying the live model"


5. Tool Solidification

By now, the entire chain can fairly stably turn natural language into executable code. But there’s one last problem:

Generating from scratch every time — is there a way to eliminate repetitive labor? Furthermore — can the system “learn” what stable patterns are from execution history?

What the LLM Does During Solidification: Identifying “What’s a Parameter, What’s Structure”

Let the LLM read multiple successful execution records and answer one specific question:

In this code, which values are dynamic parameters (changing every time), and which are static structure (the same every time)?

Example: today the user says “create a 3-meter-tall wall from A to B,” tomorrow says “create a 2.5-meter-tall wall from C to D.” The LLM, using the API as a basis, after observing the dynamically changing parameters, concludes:

  • Dynamic parameters: start coordinate, end coordinate, height, wall type, level
  • Static structure: API call sequence, unit conversion (mm → feet), Transaction wrapping, error handling

Then the LLM extracts dynamic parameters as interface parameters, retains the static structure as a code template, and outputs a structured tool definition:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
name: create_wall_by_two_points
description: Creates a wall between two specified levels
params:
start_point:
type: XYZ
required: true
end_point:
type: XYZ
required: true
level_name:
type: string
required: true
wall_type_name:
type: string
required: true
height_mm:
type: number
default: 3000
code_template: |
var level = new FilteredElementCollector(document)
.OfClass(typeof(Level))
.Cast<Level>()
.First(l => l.Name == "{level_name}");
var wallType = new FilteredElementCollector(document)
.OfClass(typeof(WallType))
.Cast<WallType>()
.First(w => w.Name == "{wall_type_name}");
var line = Line.CreateBound(
new XYZ({start_point.x}, {start_point.y}, {start_point.z}),
new XYZ({end_point.x}, {end_point.y}, {end_point.z})
);
var height_ft = {height_mm} / 304.8;
return Wall.Create(document, line, wallType.Id, level.Id,
height_ft, 0, false, false);
stats:
total_uses: 47
success_rate: 0.98
avg_exec_ms: 52

This judgment isn’t simply “find literals and replace with variables” — it’s based on comparing across multiple executions which values actually change. If a value is always 304.8 (the mm-to-feet conversion factor), the LLM won’t parameterize it; if a value repeatedly changes across different requests (like coordinates, height), it gets identified as a parameter.

As the solidification process keeps running, new executions continue to observe the dynamic parameter range of solidified tools — for example, a tool that initially only supported a single level might find that users start passing in a level list during actual execution, triggering an extension of the parameter definition. Tools aren’t written once and finalized — they evolve continuously with use.

Reuse: Within the System + Externally via MCP

After tool solidification, the next time a structurally similar request comes in:

  1. The system first matches against the tool library
  2. On hit, fill parameters and execute directly
  3. The whole process is zero LLM calls, millisecond-level returns

But more importantly — the structure of these YAML tools is directly aligned with the MCP protocol. name, description, params schema are nearly all the fields needed for an MCP tool definition. This means the tool library isn’t only used inside the RAG system:

1
2
3
4
5
revit-api-rag tool library (YAML)
↓ Automatically exposed as MCP endpoint
revit-mcp-net (MCP Server)
↓ Standard protocol
Claude Code · Cursor · Any MCP client

This pathway concretely realizes the RAG/MCP fusion discussed in Chapter 12 — programmers use MCP clients to explore new scenarios → execute and validate multiple times in the RAG system → solidify into tools → automatically expose back to all MCP clients. Once a tool is captured, everyone can use it.

The solidification layer isn’t just the RAG system’s “cache” — it’s also a tool supply source for the entire AI Agent ecosystem. The longer the system runs, the more tools it provides externally.

The growth of the tool library is, in a sense, the growth of the system. Each successful execution is no longer a one-time output but adds an asset to the entire ecosystem.

Tool Solidification loop: every success makes the system faster

What to Do When a Tool Is Unhealthy

Tools sometimes fail — parameter models change, the user’s Revit version is incompatible, an edge condition triggers a bug. So the tool library isn’t just write-only — it needs monitoring:

  • Consecutive failures exceed threshold → tool flagged as “unhealthy,” next time falls back to RAG regeneration
  • Long-term unused → demoted to avoid mismatch
  • Low hit rate → re-examine whether the parameter definition is too broad
  • Sudden change in dynamic parameter range → trigger parameter schema extension or tool splitting

This belongs to engineering details, but it determines whether the tool library can run long-term. A tool library without health monitoring will degrade over time into a pile of half-broken code.


6. Skills: Letting Users Write “Domain Rules” Directly to the AI

By now, the system has handled “general API knowledge” (RAG), “project state” (Dynamic Choices), and “successful paths” (Tool Solidification). But there’s still one class of knowledge not covered —

Every team and every project has its own conventions, agreements, and best practices. These can’t fit into RAG (because they’re not API knowledge), nor can they be stored in the Tool library (because they’re not concrete execution steps). They are high-level rules of “what principles to follow when doing this kind of thing”.

Concrete examples:

  • An internal regulation at a structural engineering firm: “column spacing should not exceed 9 meters; if it does, mark for review”
  • A BIM team’s naming convention: “family types use the format KZ-300×300; no other format is allowed”
  • A project’s drawing convention: “elevation view dimensions must display down to the millimeter”

None of these are things you can find in Revit API documentation, nor things any SDK example demonstrates. They are professional “experience accumulation” — previously living in PDF specs, Word documents, and group chat screenshots, with no AI access.

Skills is a new layer added for this: let users write rules in natural language, and the AI follows them automatically when doing related tasks.

What Skills Are

Each Skill is a Markdown file located at .ai-rules/skills/<skill-name>/SKILL.md in the project. The structure is very simple:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Column Spacing Convention

## Applicable Scenarios
When creating structural columns or modifying column layout

## Rules
1. The center distance between adjacent columns should not exceed 9000mm
2. Spans exceeding 9000mm need to be marked `requires_review = true`
3. Between 8000–9000mm, give a warning but don't block

## Not Applicable
- Decorative columns
- Heritage building renovation projects

Just that direct. No special syntax, no YAML config — just markdown. Anyone who can write a Word spec document can write this.

How Skills Are Used by the System

When the user says “create a column grid on the second floor,” the system:

  1. Performs intent recognition as usual → create_column_grid
  2. Adds a step: semantic matching in the Skills library to find applicable Skills
  3. Hits the “Column Spacing Convention” Skill
  4. Stuffs the entire rule into the generation prompt
  5. The model automatically follows during code generation — for example, the generated code actually checks column spacing and marks anything over 9 meters for review

The whole process is transparent to users. They write a rule once, and afterward all related tasks follow it automatically.

Why Skills Is the “Principle Layer” While Tool Is the “Execution Layer”

Back to the core argument in Chapter 1 about “beneficiaries” — the biggest obstacle to letting designers participate in programming is “how to convey domain knowledge to the AI.”

The system’s previous answer was Tool Solidification: let the designer do it correctly once, and the AI solidifies it into a template. This path works, but has a limit — it can only solidify “concrete execution steps,” not “high-level rules.”

Tool solves: “Do this in these steps”
Skills solves: “What principles to follow when doing this kind of thing”

A comparison:

  • Tool: “Execution steps for creating a column” — use FilteredElementCollector to get the family, activate, place, return ElementId
  • Skill: “Column spacing convention” — regardless of the specific steps, column spacing over 9 meters needs review

Tool is fine-grained “execution templates,” Skills is coarse-grained “judgment principles.” Combined, they form complete domain knowledge accumulation.

Tool and Skills: two complementary domain knowledge accumulation paths

Skills Is the Truly Lowest-Threshold Entry for Designers

Tool Solidification requires designers to do it once correctly (and let AI solidify); Skills only requires designers to write a rule.

The two thresholds differ greatly:

  • “Doing it once” means going through the entire interaction flow, validating the result, confirming no errors — half an hour minimum
  • “Writing a rule” only requires opening a markdown file and writing down the judgment already in your head — within 10 minutes

The latter’s threshold is almost as low as writing a Word spec document.

I’ve observed several designers who used this system: the first thing they used wasn’t Tool solidification — it was Skills. “Writing rules” is their most familiar way of working — they’ve been doing this their entire careers, just that previously these rules could only be written in PDF specs for people to read; now writing once lets the AI act according to the rules.

This is the most interesting thing about Skills — it doesn’t ask designers to change their work habits; it directly connects their most familiar action of “writing specs” to the AI.


7. The Agents in the System: Each with Its Own Job + Resisting the LLM’s “Helpful” Instinct

By now I’ve covered several knowledge layers of the system: data layer (RAG), fact layer (Dynamic Choices), execution reuse layer (Tool Solidification), principle layer (Skills). But I haven’t explained how these layers are wired together — that’s the Agent coordination part.

Why Not Use One Big LLM Call for Everything

An intuitive approach: compress the whole flow into one LLM call — stuff user input, API documentation, and Revit project data all in, and let it “end-to-end” output executable code.

I tried it. The result didn’t meet the requirements. The reason isn’t just that the context is too long (though that’s also a problem) — more fundamentally: different tasks have very different requirements on the model.

  • Intent classification needs to be fast, cheap, deterministic (same input → same output)
  • Data audit needs long context and the ability to compare against source files
  • Code generation needs to be stable, accurate, and semantics-aware
  • Parameterization extraction needs semantic understanding but not coding ability

Using one model for everything is neither economical nor accurate. Splitting into multiple Agents, each Agent picking the most suitable model and the most suitable prompt — that’s the fundamental reason this system can run.

Below I’ll cover several core Agents, for each describing the problem it solves, the model used, and the key prompt design.

1. Orchestrator (Master State Machine)

Not an LLM — it’s a state machine. It decides which Agent to invoke next based on the current conversation state: if intent isn’t recognized, call the Intent Agent; if parameters aren’t collected, call the Slot Agent; once all info is in, call the Code Generation Agent.

The Orchestrator itself doesn’t invoke any LLM — its existence is so other Agents can maintain single responsibility. This is an easily overlooked but very important engineering decision — coordination logic between LLM Agents should not be handed to another LLM, because state machine behavior should be deterministic and debuggable.

2. Intent Classification Agent

This is the first LLM Agent and the first decision point. It answers one question:

What kind of interaction type does the user’s request belong to this time?

I divide all Revit operations into three categories:

  • DIRECT: Single-step direct operations (delete, query, modify property) — no family type selection needed
  • SELECT_FAMILY: Creation requiring family type selection (structural columns, furniture, equipment)
  • SELECT_BOTH: Creation requiring host + family type (windows on walls, doors on walls)

This classification has a strong constraint: any operation creating a physical element must NEVER be classified as DIRECT. If the model tries to say “creating a column is a direct operation,” that’s definitely wrong — creation must select a type first.

Use Gemini Flash for this, because it’s fast, cheap, and accurate enough. When the LLM is unavailable (rate limits, timeouts), there’s a regex keyword fallback — sufficient for simple classification.

3. Slot Engine Agent (Slot Filling) — The Most Complex Agent in the System

This is the Agent with the longest prompt and most rules in the system. What it does:

Through dialog, collect all parameters needed for code generation — but never let the model guess any value itself.

A Counterintuitive Design Choice

Initially I did this with traditional Slot Filling thinking: maintain a YAML slot definition for each intent, specifying each parameter’s name, type, and validation rules. That had about 974 lines in my version, which I later deleted entirely.

The reason: every new operation required hand-writing a slot definition, and maintenance cost grows linearly with the number of supported operations. And for “custom operations” (e.g., a user says “create a ramp” — an operation I hadn’t pre-defined), you can’t write hardcoded slots at all.

The new approach: let the LLM read the real Revit API documentation (pulled from RAG), then autonomously decide what parameters to collect.

1
2
3
4
5
6
7
8
9
10
11
User input: "create 3 structural columns"

Search term extraction → ["NewFamilyInstance", "Column", "StructuralType"]

RAG fetches NewFamilyInstance's real method signature + parameter list + remarks

Pass these materials + user input + rule set to the LLM

LLM generates in one go:
- Slots already extracted from user input (quantity = 3)
- Questions still to ask (family type, level, coordinates)

Afterward, every time the user answers a question, no LLM call is made — the frontend just sequentially fills slots, and once all slots are filled, moves to the next stage. So the first round is one LLM call, and subsequent rounds are pure logic — low latency, low cost.

This Agent has many specific prompt rules — the most critical are saved for the next section “Resistance Training.”

4. Code Generator Agent

This is the Agent whose output quality is most critical to the whole system, so it uses the most expensive model (Claude).

Its core constraints will be detailed in Chapter 10 — output method body only, no Transaction, must use the document variable, unit conversion up front. These constraints together tighten the model’s output space significantly while still leaving full room for code logic.

A design worth mentioning separately is the <thinking> tag: have the model output a sub-task decomposition before writing code:

1
2
3
4
5
6
<thinking>
Sub-task 1: get the W10x49 family type via FilteredElementCollector
Sub-task 2: get the Level 1 level object
Sub-task 3: loop to create 3 FamilyInstances
Notes: mm→feet conversion, FamilySymbol.Activate()
</thinking>

This thinking chain isn’t just for debug — it forces the model to mentally walk through the entire flow before writing code, listing risk points. This “think before write” structure significantly improves final code quality.

5. Tool Solidification Agent

Chapter 5 already detailed what it does. Here I add some details about its internal logic.

The interesting thing is you can’t decide which values are parameters by looking at one execution. In a single execution, every literal “could be” a parameter — but only by comparing across multiple executions can you truly distinguish dynamic parameter (changes every time) from static structure (same every time).

This judgment can’t be done with hard rules — it must rely on the LLM’s semantic understanding. In the prompt I have it think through these steps:

  1. Read multiple execution histories for this kind of request (each thinking chain + actually generated code + user selections)
  2. Compare each literal value’s variation across different executions
  3. For each value, ask two questions: “Does it actually change between executions?” (frequency detection) + “If a different user, a different scenario, should it logically change?” (semantic judgment)
  4. Both yes → parameterize; only one yes → mark as “to observe,” accumulate more executions before deciding

The semantic judgment in step 3 is important — some values might happen to be the same across all existing execution history (e.g., 5 times all using the same level), but semantically they should be parameters. If such “statistically unchanged but semantically should change” values are missed, the tool will be unable to adapt to new scenarios because the parameter definition is too narrow.

This logic also lets tools evolve continuously: when a solidified tool is invoked in a “slightly different” way (e.g., the user suddenly passes a list of levels instead of a single level), the system re-triggers parameter schema evaluation and upgrades the tool to a more general version. Tools aren’t one-off products — they grow continuously with use.

6. Data Preparation Phase Agents (Offline)

Finally, briefly touch on the offline Agents — they don’t participate at runtime but determine the floor of the whole system:

  • Stage-1 Quality Audit (Gemini Flash): scans 27,000 API records, scoring each by 8 deduction rules. The key is it measures “parsing quality” — whether content present in HTML was parsed into JSON — not “is the content good.” This distinction is important: deleting a record that failed to parse loses API coverage, but a record that parsed correctly but has plain content is legitimate and should be kept.
  • Stage-2 Repair (Claude Sonnet): for low-score records, reads the HTML source files and repairs by the principle of “fix only, don’t fabricate” — only adding back fields present in HTML but missing from JSON, no creative additions.
  • Golden Code Generation (Claude): refines the most pedagogically valuable code snippets from 200+ projects in the SDK, filtering out noise like UI, logging, and boilerplate.

These three Agents together form “knowledge base quality control,” determining whether RAG can fetch accurate material. Their cost isn’t low, but they only run when data is updated — a typical “trade preprocessing for runtime quality” choice.

7. LLM Adapter: Making Multi-Model Composition Engineering-Feasible

The last module that’s not really an Agent but is critical: the LLM Adapter.

The whole system uses multiple models — Gemini Flash for lightweight tasks, Claude for high-quality generation, OpenAI embedding for vectorization, Cohere for rerank. If every Agent directly called the native SDK, the code would become a mess.

The Adapter unifies all model calls under one interface, with primary/fallback switching:

1
2
3
4
primary:
model: google/gemini-3-flash-preview
fallback:
model: openai/gpt-5.3-codex

403 (quota), 429 (rate limit), 5xx, and timeouts on the primary model all auto-switch to fallback. This kind of fault tolerance is mandatory in production — a single provider going down should not bring the whole system down.

Going through OpenRouter as a unified gateway also incidentally solves another thing: support for the China network environment. Gemini, Claude, OpenAI APIs have varying access stability domestically; routing all through OpenRouter via proxy makes deployment simpler.

Agent Coordination Picture

Stringing these Agents together, the overall data flow is roughly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
User input

Orchestrator state machine

Intent Agent (Gemini Flash)
↓ Classification result
Slot Engine Agent (Gemini Flash + RAG + Revit Query)
↓ Slots filled
Code Generator Agent (Claude + RAG)
↓ Generated code
Sandbox static scan (not LLM)
↓ Pass
MCP Bridge → Revit Plugin (Roslyn compile and execute)
↓ Success
Tool Solidification Agent (Claude)
↓ Parameterize
Tool library (YAML)

Every step is loosely coupled — any Agent can be independently replaced, tested, and tuned. This is the biggest engineering advantage of multi-Agent architecture: complexity is sliced into independently manageable pieces rather than piled into one prompt.

Agent coordination: each Agent uses the most suitable model


8. Resistance Training: How to Make the Model Refuse Its “Helpful” Instinct

This section gets its own treatment because it’s the core that makes the system usable — and the real solution to the “AI default values / fabrication” problem from the introduction.

LLMs have a behavior pattern you must face directly: when it doesn’t know a value, it tends to “actively help” — give a reasonable-looking default, instead of admitting “I don’t know”. This behavior is a product of RLHF training. In general dialog scenarios, “give an answer” is rewarded more than “ask repeatedly,” so the model learned that “guessing a reasonable value is more helpful than asking the user.”

But in a scenario like Revit, which requires strict alignment with project data, “actively helping” is a disaster. The 300x300 column example in the introduction is typical: the model doesn’t know what family types the project has, so it “helpfully” gives the most common 300x300 from training data. The code runs, the result is wrong.

So — how do you train the model to violate this instinct?

The “training” here isn’t fine-tuning. This system doesn’t change a single model weight — it’s all prompt-level behavioral constraints. But the effect is equivalent to training: using prompts to flip the model from “actively guessing” default behavior to a “explicitly refusing to guess” state.

Below are several techniques that actually work.

One: Tell the Model Its “Capability Boundary” Up Front

This is the most critical line in the Slot Engine prompt:

You are NOT connected to a live Revit session. You CANNOT look up types, levels, or positions.

On the surface, this is a factual statement; in reality, it’s a redefinition of the model’s role.

Without this line, the model defaults to assuming it’s “all-knowing” — it will say “use the first available column type” or “assume the level is 1F.” With this line added, the model goes from “all-knowing advisor” to “information collector” — its job isn’t to give answers, it’s to ask questions.

The effect of this single line is stronger than ten “don’t guess” reminders later in the prompt. Because it changes the model’s role cognition, not its specific behavior. When you define “who I am,” “what I should do” follows automatically; the reverse doesn’t work.

Two: Explicit FORBIDDEN List + “These Are Errors” Label

Just saying “don’t guess” doesn’t work — the model treats it as a soft suggestion. It must be written as a hard constraint:

FORBIDDEN behaviors (these are ERRORS):

  • Picking a default type/family
  • Inventing coordinates
  • Assuming a level
  • Picking StructuralType silently

The (these are ERRORS) label is critical — it upgrades “shouldn’t do” to “doing it is a bug.” RLHF-trained models are very sensitive to the word “error,” and the output space tightens noticeably.

This kind of “explicit ban” is far more effective than “implicit guidance.” A common mistake is writing “please try to use real data” — soft guidance like this is largely ignored by the model. Switch to “FORBIDDEN: fabricating default values (this is an ERROR),” and hit rate drops visibly.

Three: Give the Model an “Honest” Way Out

Just banning isn’t enough — also provide a compliant alternative behavior:

For EVERY parameter, you MUST either:
a) Extract its EXACT value from the user’s input text, OR
b) Create a question for the user

This step turns “admit not knowing” into a positive behavior. The model no longer struggles with “if I don’t answer, am I being unhelpful?” — it now has a clear compliant exit: ask.

This principle generalizes: when forbidding a behavior, you must simultaneously provide a legitimate alternative. Otherwise the model feels a “dilemma,” and ultimately picks the behavior it was rewarded for during training.

Four: Let RAG Context Appear in the Prompt, Physically Crowding Out the Model’s “Memory”

This is a less-discussed but very important point.

Many of the model’s “hallucinations” come from memory of training data — it “remembers” Wall.Create has roughly some parameters, “remembers” levels are usually called Level 1. When the prompt context lacks more specific material, it uses these memories.

But when the prompt actually includes the real method signature pulled from RAG:

1
2
NewFamilyInstance(XYZ location, FamilySymbol symbol, Level level, 
StructuralType structuralType)

The model’s “memory” gets suppressed by this more specific, more recent context — it preferentially uses the material in front of it rather than the version in training memory. This is what RAG really does: it doesn’t just provide information, it pulls the model’s attention from training memory to the current context.

Five: Dynamic Choices

The previous four techniques are at the prompt level; this last one is at the architecture level.

No matter how well the prompt is written, if the model really “wants” to guess, it has room. But if you put the actually existing family type list in the project directly into the prompt:

1
2
3
4
## Available Family Types in current project
- W10x49
- W12x26
- W14x30

The model can no longer guess “300x300” — because the context clearly states only these three are available. At this point its output space is physically tightened to legitimate options.

That’s why Dynamic Choices isn’t just “adding some information” — it’s the last guardrail of resistance training. Prompt constraints are “soft” and can be circumvented by the model’s “creative play”; injecting real data is “hard” and leaves the model no room to play.

A Real Comparison

With the same request “create a 300x300 concrete column,” tested in three configurations:

Configuration Output Behavior
Pure LLM (no constraints) Directly .FirstOrDefault(s => s.Name.Contains("300x300")) — fabrication
LLM + FORBIDDEN prompt Still ~30% chance of “creatively” bypassing the ban
LLM + FORBIDDEN prompt + Dynamic Choices 100% lists real candidates and asks the user to choose

The conclusion is direct: prompt constraints alone aren’t enough — they must be paired with real data injection. Together they form the complete resistance training solution.

Resistance training: 5 layers of defense, flipping the model from "actively guessing" to "explicitly asking"


9. Why Not Agent + grep

Throughout building this system, a question I revisited:

Now that Agent frameworks (Claude Code, Cursor, etc.) are so mature, why not let the Agent directly grep Revit API docs and scan SDK source code? Why build a RAG?

This is a tempting option. The benefits of Agent + grep are obvious:

  • No preprocessing required — documents stay where they are
  • No vector store to maintain — no re-embedding cost
  • High flexibility — can handle obscure, deep-dive questions
  • Modify the source = modify the file — takes effect immediately

I seriously considered this path for a while, and ultimately chose RAG as the main pipeline. Below are several key judgment points.

One: Agent + grep Is for “Exploration,” Not “High-Frequency Structured Queries”

Where does the Agent framework really shine? My personal take: it’s good at deep investigation in uncertain places. When the question is “where exactly does this bug come from” or “is this material talking about some obscure usage,” the Agent’s “search and reason” capability is very strong.

But the core task of Revit auto-programming isn’t exploration — it’s high-frequency structured API location. Users repeatedly request operations like “create wall,” “get parameter,” “modify type” — each time needing the same kind of information: method signatures, parameter lists, related enums. In this scenario, having the Agent re-grep every time is wasteful.

RAG’s approach is to push this “repetitive labor” to the offline stage: clean once, embed once, and afterward every query is a pure retrieval operation. In high-frequency scenarios, this trades “preprocessing cost” for “online response speed.”

Two: Errors Amplify Along the Pipeline

It’s easy for the Agent to find a “close enough” method in the docs — same name but wrong signature, right class but wrong namespace, or applying old-version API in new-version context. Each error individually isn’t large, but once it occurs in the retrieval stage, subsequent reasoning and generation will keep building on the wrong premise.

The rerank and hydrate steps in the RAG pipeline are precisely to filter out these “close enough”s. Cutting them and letting the LLM judge by itself would noticeably increase the error rate.

Three: Agent Mode Has No Accumulation

This is the most important point I only realized later.

The Agent runs through one task and starts from scratch the next. This is the inverse of the Tool Solidification I described earlier: Agent emphasizes “dynamic reasoning every time,” Tool emphasizes “solidifying validated reasoning results.”

In my scenario, “dynamic reasoning every time” means re-bearing the uncertainty of generation each time. Tool solidification makes the system faster and more stable the more it’s used — this compounding doesn’t exist in Agent mode.

My Conclusion

The two approaches aren’t substitutes — they’re a division of labor:

  • RAG + Tool: suited for high-frequency, structured, speed- and stability-sensitive main pipelines
  • Agent + grep: suited for low-frequency, open-ended, deep-investigation edge cases

My main pipeline uses RAG, but I keep a fallback — if the user raises an obscure question that RAG can’t recall and the tool library doesn’t cover, fall back to Agent mode to search raw materials. This “stable main pipeline + flexible edge cases” combo is more suitable than betting solely on either.


10. The Execution Side: Putting Constraints Where They Belong

No matter how good the knowledge layer and retrieval layer are, the code ultimately has to run inside Revit. There are several inconspicuous but critical engineering decisions in this segment too.

1. Server Bridges via Plugin, Not Directly Controlling Revit

My server doesn’t run inside the Revit process. It communicates with the Revit plugin via a local TCP / JSON-RPC protocol:

1
[Server Python]  ←→  [Revit Plugin (.NET)]  ←→  [Revit Document]

The benefit of this separation is a clear execution boundary. The server handles all “generation” work, the plugin handles all “execution” work. The two sides decouple via protocol — debugging, version upgrades, error isolation are all more controllable.

2. Generator Outputs Method Body Only

This is one of the most effective constraints for anti-hallucination.

I didn’t have the model output a full IExternalCommand implementation — I only have it output the method body:

1
2
3
4
5
// The model only generates the contents inside the braces
public static object Execute(Document document, object[] parameters)
{
// ← LLM-written code goes here
}

Why design it this way?

  • Method signature is fixed: parameters and return value are determined; the model doesn’t decide these
  • Transaction declarations are not allowed: the plugin uniformly wraps transactions; the model doesn’t worry about it
  • Importing unauthorized namespaces is not allowed: the plugin pre-usings allowed namespaces
  • Must return object: keeps return value serialization consistent

Compressing the model’s “freedom” into this small frame tightens the output space significantly. This is far more effective than “repeated reminders in the prompt.”

3. Roslyn Dynamic Compilation + ExternalEvent Execution

After the plugin receives the code, it:

  1. Wraps the method body into a fixed execution class
  2. Uses Roslyn to dynamically compile to an in-memory assembly
  3. Invokes via Revit’s ExternalEvent mechanism inside a transaction
  4. Serializes the result back

Every step in this chain is observable. On error, you can pinpoint whether it was a compilation failure, transaction failure, or API call failure. This granularity is important for debugging and tool solidification — only when you can pinpoint the failure cause can you decide whether to add this execution to the tool library.


11. Stringing the Whole Pipeline Together

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
User input

Intent recognition (lightweight: just task type + can hit a tool)

Branch:
├─ Hit tool library → fill parameters → execute directly (millisecond, zero LLM)

└─ Miss → full pipeline

Query Rewrite (natural language → API keywords)

Dual Search (API docs + SDK examples in parallel)

Rerank + Hydrate (re-rank + fetch full content)

Dynamic Choices (query the live Revit document)

LLM generates method body (constrained by retrieval results)

Plugin Roslyn compile + transaction execution

Success?
├─ Yes → optional: parameterize as a tool, add to tool library
└─ No → report error + adjust and retry

The real value of this pipeline isn’t “how many models it uses” — it’s that every layer solves a specific, explainable engineering problem:

Layer Problem Solved
Data layer Knowledge sources are inaccurate, noisy
RAG layer API alignment, generation boundary
Dynamic Choices Project-level entity dependencies
Tool layer Reuse and accumulation of execution steps
Skills layer Accumulation of team/project principles
Generation constraints layer Execution environment boundary
Plugin layer Transactions, compilation, units

Every layer doesn’t exist “just to use some technology.” They exist because the preceding problem really does happen.

The whole pipeline: every layer corresponds to a specific failure mode


12. Fusion of RAG Mode and MCP Mode

By this point, the entire RAG-based design has been covered. But there’s still a worthwhile question to expand on — what’s the relationship between this approach and the increasingly popular MCP (Model Context Protocol) mode?

While building this RAG project, I’ve also been maintaining another MCP project (revit-mcp-net) — a standalone MCP server that lets AI Agents like Claude Code and Cursor connect directly into Revit operations.

I want to make this clear: these two modes aren’t opposing — they’re each suited to different scenarios. Below I’ll lay out their characteristics, then talk about how to fuse them.

What Each Mode Excels At

RAG mode (this project): structures domain knowledge, pre-indexes it, and retrieves it for injection at runtime. The whole flow is deterministic, debuggable, and low-latency. It excels at:

  • High-frequency structured tasks (creating walls, placing columns, modifying types)
  • Workflows sensitive to response speed
  • Scenarios requiring stable, consistent output
  • Designer-friendly — structured UI guidance, zero-code threshold

MCP mode (Claude Code, Cursor, etc. plugged into MCP servers): lets the AI Agent directly access external tools/services via a standard protocol. The Agent itself decides what tools to call, when, and how to chain them. It excels at:

  • Open-ended exploration (“look at whether the column layout in this project has any spec issues”)
  • Cross-system collaboration (Revit + Excel + Email together)
  • Handling complex scenarios not seen before
  • Programmer-friendly — flexible, powerful, with on-the-fly Agent reasoning

The Fused Workflow

A concrete workflow:

1
2
3
4
5
6
7
8
9
10
[Programmer] uses Claude Code + revit-mcp-net for open-ended exploration
↓ debugs out a piece of code handling "complex curtain wall parameterization"
↓ the code involves many API calls, rule judgments, edge handling

[Accumulation] saves this successful code as a Tool in the revit-api-rag tool library
↓ writes the embodied judgment principles as a Skill markdown

[Designer] uses revit-api-rag, inputs "create a curtain wall system"
↓ tool library hits directly, Skills auto-applied
↓ result in seconds, no need to understand code at all

This flow captures the best of both modes:

  • MCP mode does “exploration”: programmers have full flexibility to crack complex, never-seen-before problems
  • RAG mode does “distribution”: exploration results get accumulated through Tool/Skills, designers can directly reuse

In other words —

MCP lets one programmer solve one class of complex problem; RAG lets one programmer’s results be reused by a hundred designers.

Fusion Lets Designers and Developers Both Participate

Back to Chapter 1’s core argument — “letting people with logic but no code participate in programming.”

With RAG mode alone, designers can automate within the range of existing tools. But when designers encounter complex scenarios the tool library hasn’t covered, they get stuck.

With MCP mode alone, programmers can solve complex scenarios. But the solutions tend to be “one-off” — next time another designer encounters a similar problem, the programmer has to redo it.

Connect the two:

  • Programmers use MCP mode for frontier exploration → solve the hardest, newest problems
  • Accumulation mechanisms (Tool + Skills) turn successful paths into reusable assets
  • Designers use RAG mode to invoke these assets → no need to reinvent the wheel

Each kind of person does what they do best. Programmers exercise exploration, designers exercise domain judgment, accumulation mechanisms do the translation.

I think this is what “human-machine collaboration” should look like in the AI era — not one tool ruling the world, but an ecosystem: exploration layer, accumulation layer, reuse layer — each with the people most suited.

RAG and MCP aren't substitutes — they're fusion: each kind of person gets their place

The Engineering Connection Point

How do RAG and MCP technically fuse? The most direct way is to use the MCP server as a “fallback channel above the sandbox”:

1
2
3
4
5
6
7
8
9
User request

RAG main pipeline → tool hit? → reuse directly (millisecond)
↓ Miss
RAG generation → Sandbox check → Revit execution
↓ Still fails / user marks "needs more complex handling"
Switch to MCP mode → Agent on-the-fly exploration → debug out a new solution

New solution flows back to RAG tool library or Skills

In this pipeline, RAG is the “daily path,” MCP is the “exploration path,” and the two feed each other through Tool/Skills accumulation. The system gets more complete with use — the broader the programmer’s exploration boundary, the more tools the designer has.

revit-api-rag hasn’t fully implemented this MCP fallback yet — that’s a worthwhile direction next. But the architecture has left the entry point: the post-sandbox fallback hook can directly connect to revit-mcp-net or any other MCP server.

This “dual-mode + accumulation bridge” architecture, I think, applies not only to Revit. Any “professional software + AI automation” scenario — as long as both code-savvy people and non-coding domain experts coexist — this structure could apply.


13. A Case Study

If you’re learning how to build a similar system, the following pieces can be extracted and viewed independently:

  • Raw material pruning and embedding: pipeline/api_parser/ has the complete .chm parsing, HTML noise cleaning, and structured object extraction flow. The core of this part isn’t “what embedding model is used” — it’s how to make dirty data clean. Most RAG projects fail at this step, not at the vector store.
  • Multi-model composition: embedding uses OpenAI text-embedding-3-large, rerank uses Cohere rerank-v3.5, main generation uses Gemini, intent classification uses smaller faster models. Each step uses the most suitable model, with OpenRouter as a unified gateway. This composition is more economical and accurate than betting on one model alone.
  • Agent invocation and Workflow orchestration: mcp_bridge/ shows how to split generation, validation, execution into independently debuggable steps; tool_store shows how to accumulate Workflow’s successful paths into Tools.
  • Skill encapsulation: the skills/ directory divides “operation patterns,” “workflow blueprints,” and “spec constraints” into three classes of knowledge — each with different encapsulation formats and invocation methods. This division can be borrowed directly into other domains.
  • Runtime bridging: revit_plugin/ demonstrates how to use TCP/JSON-RPC to decouple the external generator from the host software, how to use Roslyn for dynamic compilation, and how to use ExternalEvent to safely execute inside transactions. This pattern can be applied to any desktop software with plugin APIs.

Every piece is kept as independent as possible. You can copy just the data cleaning section, or just reference the Tool solidification mechanism — no need to copy the whole thing.

Final Thoughts

When I started this project, what I originally wanted to solve was code generation correctness: make the AI guess less, fabricate less, and produce fewer silently wrong results when writing Revit code.

But the deeper I went, the more I realized the essence isn’t a “model problem” — it’s a “data flow problem”:

The model is already strong enough. What it lacks isn’t brain — it’s the facts inside the project that only exist at runtime, that the model can’t see from training data: which family types, which levels, which views, what the user currently has selected. Delivering these facts to the model in a structured, reliable, low-latency way — that’s everything this framework does.

This approach worked on Revit. The same logic — in CAD auto-modeling, EDA design assistance, professional simulation software, medical image analysis — any domain with “complex API + real runtime environment + high failure cost” — will hit similar problems and need similar solutions.

I hope it’s useful to you. Whether you want to use this framework directly, build something similar yourself — or just extract one independent design idea from it.