I built an MCP tool for ESMM validation.
Errors in integration specs always got found. Just never in time. So I built an MCP tool to catch them earlier.
The way it works is: an analyst writes the ESMM, passes it along, and the error turns up either in test analysis or in the developer’s code. Either way you go back to the spec and redo it. It happened regularly and there was no tool that checked the xpath beforehand. So I wrote one.
What ESMM is and where the errors come from
ESMM is a mapping table in Excel. An analyst writes into it how data moves between systems: what xpath it has in the input message, what it is called in the target, what conditions apply. From that come the technical specs, unit tests and implementation code.
Example xpath for a REST service:
createInternalPO//POST/request/BODY/accountNumber/numberPart1
And for TIF/WMB:
//getListRequest/identity/userLoginName
The format is different for each technology. REST has a different structure than TIF, arrays are written differently, slash rules differ. All of it has to match exactly against the XSD or YAML file structure. A typo, wrong capitalisation, an extra leading slash. None of these errors are visible in Excel until someone else starts processing it.
The rules are clear and mechanical. They can be checked automatically.
What MCP is and what you can build with it
Before the implementation, a quick word on MCP as a concept, because it gets talked about a lot but rarely explained concretely.
MCP (Model Context Protocol) is a protocol that defines how an AI client (VS Code Copilot, Claude Desktop, …) communicates with an external server. The server runs locally or somewhere on the network, the client connects, and from that point the AI has access to whatever the server exposes.
A server can offer four types of things:
Tools are functions the AI can call. In my case there are six: workspace discovery, temp folder creation, XLSX to markdown conversion, orchestration analysis, IMS file lookup, and validation input preparation. The AI gets a list of available tools with descriptions and decides itself when and how to call them.
Resources are static data or knowledge the AI can read. Mine are validation rules: a full spec of what valid xpath looks like for REST, what for TIF, what counts as a comment, what gets checked and what does not. The AI loads them as context before it starts validating.
Prompts are predefined templates that describe how to start working. Instead of the user writing an instruction from scratch, they pick a prompt and the server prepares a structured input. My workflow prompt looks like this:
You are an AI agent for complete ESMM validation. Run steps 1–8 sequentially.
Do not skip, do not parallelize.
1. discover_workspace_structure – find the service, save SERVICE_ROOT_PATH.
2. create_temp_validation – create a folder for artifacts.
3. convert_to_md – convert XLSX, head.md must be created.
4. analyze_orchestration_info – read orchestration, find services for IMS lookup.
5. find_ims_service – find files for each service.
6. create_ai_validation_sampling – build sampling_input.md.
7. auto_ai_validation – run sampling, save result.
8. Final report – summarise results or error.
The user writes a service name, triggers the prompt and the server takes them through the whole workflow automatically.
Sampling is the fourth type and the most interesting one. The server can ask the client through the protocol to call an AI model on its behalf. More on that below.
In index.js the initialisation looks like this:
this.server = new Server(
{ name: "esmm-validation-server", version: "1.0.0" },
{
capabilities: {
tools: {},
resources: {},
prompts: {},
sampling: { createMessage: true },
},
}
);
Four lines, four capability types. The server communicates over stdio — the client starts it as a subprocess and everything goes through standard input and output.
How it works step by step
The first six steps are a preparation pipeline. The server walks the repository structure and finds the service folder. It creates a temp_validation folder for intermediate results. It converts XLSX files to markdown, each sheet as a separate table. From the header sheet it reads which services are orchestrated and what technology they use. For each service it finds the corresponding files.
File lookup was an interesting problem because repository structures are not always consistent. I implemented multi-stage searching: first an exact match by folder name, then pattern matching by system prefix, then a recursive search of the whole folder. Only then does it say nothing was found.
After finding all files it builds sampling_input.md: one large markdown document with all data combined. ESMM tables, XSD structures, YAML definitions.
MCP Sampling in practice
The seventh step is the key one. Instead of calling an AI API directly I used Sampling: the server asks the client to call the model on its behalf.
Why do it this way? The client (Copilot) handles authentication, rate limiting and model selection itself. The server does not need to deal with any of that. The model also sees the full working context in the client, not just an isolated API request.
Large ESMM files had to be split into blocks, otherwise they would not fit in the token limit:
const MAX_BLOCK = 40000;
const blocks = [];
for (let i = 0; i < samplingContent.length; i += MAX_BLOCK) {
blocks.push(samplingContent.substring(i, i + MAX_BLOCK));
}
const messages = blocks.map((block, idx) => ({
role: "user",
content: {
type: "text",
text: idx === 0 ? block : `# CONTINUATION\n${block}`,
},
}));
const samplingResponse = await server.createMessage({
messages,
systemPrompt,
maxTokens: 16000,
modelPreferences: { intelligencePriority: 0.9 },
});
Each block is a separate message, the second and later ones have a # CONTINUATION header so the AI knows it is getting sequential data. The system prompt defines the exact validation rules.
The output is a markdown table:
| index | file | sheet | row | column | raw_path | error_code | expected |
If the AI returns something unusable or an empty response, the tool discards the result and returns an empty skeleton. Bad results do not propagate further.
What does not work and what I am fixing
It would not be fair to only write about what works.
The biggest practical problem is the context window. With a larger number of files or more complex XSD structures the context gets exhausted before the AI finishes validating. Sampling breaks off or returns an incomplete result. I handle it by splitting into smaller blocks, but with really large ESMM files it still hits the limit.
Second issue: eight sequential steps the user has to trigger manually is inconvenient in practice. I built a simpler three-step workflow that compresses the whole process and the user triggers it once. Same result, much less friction.
Third thing, a positive discovery this time: the model matters a lot. claude-sonnet-4-6 with Thinking mode handles things where other models get stuck. Specifically: when the tool cannot find a file due to a naming inconsistency, the model identifies where the problem is, fills in the missing input and continues into the sampling phase without the user having to step in. That is a difference that saves real time.
Where it is now
Colleagues in the integration department are starting to use it. They run validation before the ESMM goes to test analysis and errors that used to appear at the tester or developer stage are caught earlier now.
The project is internal and still evolving. But the principle transfers: take a spreadsheet spec, compare it against the actual files, return what does not match. MCP gives that a clean structure: tools for actions, resources for knowledge, prompts for workflow, sampling for AI. Four building blocks and from them you can put together something that behaves like a proper participant in the development environment, not an isolated script.