Building an AI Agent for Test Failure Investigation with Function Calling
When tests fail in complex codebases, developers often spend significant time investigating the root cause. This post explores building an AI-powered agent that automatically investigates test failures by analyzing code, database, git history, and test reports using function calling capabilities.
Consider a typical test failure scenario:
- An integration test fails with an assertion error
- The developer needs to:
- Read the test output to understand the failure
- Locate the code being tested
- Review recent changes that might have caused the issue
- Analyze the code
- Query if needed the database to verify state
This investigation process is repetitive and time-consuming. Can we automate it with AI?
Solution Overview
We’ll build a command-line AI agent that:
- Ingests a codebase into a vector database (RAG - Retrieval Augmented Generation)
- Uses function calling to interact with files, git, and databases
- Investigates test failures by analyzing Surefire reports and source code
- Generates detailed reports explaining the root cause
The agent will support both cloud-based (Google Gemini) and local (Ollama) LLM providers.
Architecture
┌─────────────────┐
│ CLI Command │ (Picocli)
└────────┬────────┘
│
▼
┌─────────────────┐ ┌──────────────┐
│ AI Agent │◄────►│ LLM Provider │
│ (LangChain4j) │ │ Gemini/Ollama│
└────────┬────────┘ └──────────────┘
│
├──► FileTools (read source code)
├──► GitTools (check history)
├──► DatabaseTools (query test DB)
│
▼
┌─────────────────┐
│ PGVector │ ◄─── Knowledge Base (RAG)
│ (Embeddings) │
└─────────────────┘
Step-by-Step Implementation
Step 1: Project Setup
Create a Quarkus Maven project with the following dependencies:
<dependencies>
<!-- CLI Framework -->
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-picocli</artifactId>
</dependency>
<!-- LangChain4j - AI Providers -->
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-ai-gemini</artifactId>
</dependency>
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-ollama</artifactId>
</dependency>
<!-- Vector Store for RAG -->
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-pgvector</artifactId>
</dependency>
<!-- Embedding Model -->
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-embeddings-bge-small-en-q</artifactId>
</dependency>
<!-- Database -->
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-hibernate-orm-panache</artifactId>
</dependency>
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-jdbc-postgresql</artifactId>
</dependency>
</dependencies>
Step 2: Docker Services
Create docker-compose.yml for required services:
version: '3.8'
services:
# PostgreSQL with pgvector extension for storing embeddings
pgvector:
image: pgvector/pgvector:pg17
environment:
- POSTGRES_HOST_AUTH_METHOD=trust
- POSTGRES_USER=doctor
- POSTGRES_PASSWORD=doctor
- POSTGRES_DB=doctor
ports:
- "127.0.0.1:5433:5432"
volumes:
- ./pgvector-init.sh:/docker-entrypoint-initdb.d/pgvector-init.sh:z
# Ollama for local LLM inference (optional)
ollama:
image: ollama/ollama:latest
ports:
- "127.0.0.1:11434:11434"
volumes:
- ollama-data:/root/.ollama
volumes:
ollama-data:
Create pgvector-init.sh:
#!/bin/bash
set -e
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
CREATE EXTENSION IF NOT EXISTS vector;
EOSQL
Step 3: Basic Configuration
Create src/main/resources/application.properties:
# Embedding Model Configuration
quarkus.langchain4j.embedding-model.provider=dev.langchain4j.model.embedding.onnx.bgesmallenq.BgeSmallEnQuantizedEmbeddingModel
# Vector Store Configuration
quarkus.langchain4j.pgvector.dimension=384
# Datasource for pgvector
quarkus.datasource.db-kind=postgresql
quarkus.datasource.username=doctor
quarkus.datasource.password=doctor
quarkus.datasource.jdbc.url=jdbc:postgresql://localhost:5433/doctor
# Hibernate Configuration
quarkus.hibernate-orm.database.default-schema=public
quarkus.hibernate-orm.log.sql=false
quarkus.hibernate-orm.schema-management.strategy=update
# Package as uber-jar
quarkus.package.jar.type=uber-jar
Step 4: AI Provider Configurations
Create src/main/resources/application-gemini.properties:
# Google AI Gemini Configuration
quarkus.langchain4j.chat-model.provider=ai-gemini
quarkus.langchain4j.ai.gemini.api-key=${GEMINI_API_KEY:}
quarkus.langchain4j.ai.gemini.timeout=60s
quarkus.langchain4j.ai.gemini.log-requests=false
quarkus.langchain4j.ai.gemini.log-responses=false
quarkus.langchain4j.ai.gemini.chat-model.model-name=gemini-1.5-flash
Create src/main/resources/application-ollama.properties:
# Ollama Local Configuration
quarkus.langchain4j.chat-model.provider=ollama
quarkus.langchain4j.ollama.base-url=http://localhost:11434
## Ollama is much much slower than Gemini or any other remote model.
quarkus.langchain4j.ollama.timeout=300s
quarkus.langchain4j.ollama.chat-model.model-id=mistral
quarkus.langchain4j.ollama.chat-model.temperature=0.1
Step 5: Define the AI Agent Interface
This is where LangChain4j’s magic happens. Define a simple interface:
@ApplicationScoped
@RegisterAiService(tools = {FileTools.class, GitTools.class, DatabaseTools.class})
@SystemMessage(fromResource = ".prompt")
public interface DoctorAgent {
String chat(@UserMessage String userMessage);
}
The @RegisterAiService annotation tells Quarkus to:
- Generate an implementation that connects to the configured LLM
- Enable the specified tools for function calling
- Use the system prompt from
.promptfile
Step 6: Create the System Prompt
Create src/main/resources/.prompt:
You are a test failure debugger. Your job: find what broke and show the exact code.
## Your Tools
**File Tools**:
- readFile(filePath) - read entire file content (use absolute paths)
- searchInFile(filePath, text, contextLines) - find text in file with context
**Git Tools**:
- getRecentCommits(repoPath, baseBranch, limit) - see recent commits
- getFilesChangedInCommit(repoPath, commitHash) - files changed in a commit
- getFileDiff(repoPath, commitHash, filePath) - what changed in a file
**Database Tools**:
- executeQuery(sql) - run SQL query on test database
## Process
1. Analyze the test failure - understand what the test expects vs what failed
2. Identify the class being TESTED (not the test class itself)
3. Read the ENTIRE production code file - use readFile()
4. Look for: commented code, extra characters, typos, logic errors
5. Only if code looks correct - check git history
6. Show the actual problematic code you found
## Response Format
**Problem:** [what's broken - be specific]
**Code found in [filename]:**
[show the actual code with the bug]
**Fix:** [what to change]
## Critical Rules
- Always use readFile() first - read the ENTIRE file
- Look for: commented code, extra characters, typos, wrong calculations
- MUST show actual code from readFile/searchInFile results
- Only check git if the code looks correct but test still fails
This prompt guides the LLM on how to use the tools effectively.
Step 7: Implement Function Calling Tools
The @Tool annotation marks methods that the LLM can invoke. Each tool receives parameters from the LLM and returns a string result.
FileTools.java
@ApplicationScoped
public class FileTools {
@Tool("Read the entire content of a file. Use absolute paths.")
public String readFile(String filePath) {
// Read file, handle errors, truncate if too large
// ...
}
@Tool("Search for text in a file and return matching lines with context.")
public String searchInFile(String filePath, String searchText, int contextLines) {
// Search file, return matching lines with surrounding context
// ...
}
}
GitTools.java
@ApplicationScoped
public class GitTools {
@Tool("Get recent commits in the current branch compared to a base branch.")
public String getRecentCommits(String repoPath, String baseBranch, int limit) {
// Execute: git log --oneline -n <limit> <baseBranch>..HEAD
// ...
}
@Tool("Get the list of files changed in a specific commit.")
public String getFilesChangedInCommit(String repoPath, String commitHash) {
// Execute: git show --name-only --pretty=format: <commitHash>
// ...
}
@Tool("Get the diff of a specific file in a commit.")
public String getFileDiff(String repoPath, String commitHash, String filePath) {
// Execute: git show <commitHash>:<filePath>
// ...
}
}
DatabaseTools.java
@ApplicationScoped
public class DatabaseTools {
@Inject
@DataSource("test-db")
AgroalDataSource dataSource;
@Tool("Execute a SQL SELECT query on the test database. Only SELECT queries allowed.")
public String executeQuery(String sql) {
// Validate SELECT only, execute query with timeout and row limit
// Format results as table
// ...
}
}
Key points:
- Tools are simple methods that return strings
- The LLM decides when and how to call them based on the system prompt
- Error handling is crucial - return descriptive error messages
- Add safety limits (timeouts, max rows, file size limits)
Step 8: Implement RAG (Retrieval Augmented Generation)
The agent needs context about the codebase. We’ll use RAG to ingest and retrieve relevant code:
IngestionService.java
@ApplicationScoped
public class IngestionService {
@Inject EmbeddingStore<TextSegment> embeddingStore;
@Inject EmbeddingModel embeddingModel;
@ConfigProperty(name = "ingest.allowed-extensions")
List<String> allowedExtensions;
@ConfigProperty(name = "ingest.skip-directories")
List<String> skipDirectories;
@ConfigProperty(name = "ingest.max-file-size")
long maxFileSize;
@ConfigProperty(name = "ingest.chunk-size")
int chunkSize;
@ConfigProperty(name = "ingest.chunk-overlap")
int chunkOverlap;
@Transactional
public void ingestDirectory(Path sourcePath, boolean clearExisting) {
if (clearExisting) {
IngestionMetadata.clearIngestionMetadata(sourcePath.toString());
}
long startTime = System.currentTimeMillis();
List<Document> documents = loadFilteredDocuments(sourcePath);
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.embeddingStore(embeddingStore)
.embeddingModel(embeddingModel)
.documentSplitter(DocumentSplitters.recursive(
chunkSize, chunkOverlap, new HuggingFaceTokenizer()))
.build();
ingestor.ingest(documents);
long duration = System.currentTimeMillis() - startTime;
// Save metadata
IngestionMetadata metadata = new IngestionMetadata();
metadata.sourcePath = sourcePath.toString();
metadata.ingestedAt = Instant.now();
metadata.documentCount = documents.size();
metadata.durationMs = duration;
metadata.persist();
Log.infof("Ingested %d documents in %d ms", documents.size(), duration);
}
private List<Document> loadFilteredDocuments(Path sourcePath) {
List<Document> documents = new ArrayList<>();
try (Stream<Path> paths = Files.walk(sourcePath)) {
paths.filter(Files::isRegularFile)
.filter(this::shouldIncludeFile)
.forEach(path -> {
try {
String content = Files.readString(path);
if (!content.isBlank()) {
documents.add(Document.from(content,
Metadata.from("source", path.toString())));
}
} catch (Exception e) {
Log.warnf("Skipping file %s: %s", path, e.getMessage());
}
});
} catch (IOException e) {
throw new RuntimeException("Failed to load documents", e);
}
return documents;
}
private boolean shouldIncludeFile(Path path) {
String fileName = path.getFileName().toString();
String pathStr = path.toString();
// Check extension
boolean hasAllowedExtension = allowedExtensions.stream()
.anyMatch(fileName::endsWith);
if (!hasAllowedExtension) return false;
// Check skip directories
boolean inSkipDirectory = skipDirectories.stream()
.anyMatch(pathStr::contains);
if (inSkipDirectory) return false;
// Check file size
try {
long size = Files.size(path);
return size <= maxFileSize && size > 0;
} catch (IOException e) {
return false;
}
}
}
RagConfiguration.java
@ApplicationScoped
public class RagConfiguration {
@ConfigProperty(name = "rag.max-results", defaultValue = "20")
int maxResults;
@ConfigProperty(name = "rag.min-score", defaultValue = "0.5")
double minScore;
@Produces
@ApplicationScoped
public RetrievalAugmentor createRetrievalAugmentor(
EmbeddingModel embeddingModel,
EmbeddingStore<TextSegment> embeddingStore) {
Log.debug("RAG Configuration: Creating RetrievalAugmentor with PGVector");
ContentRetriever contentRetriever = EmbeddingStoreContentRetriever.builder()
.embeddingStore(embeddingStore)
.embeddingModel(embeddingModel)
.maxResults(maxResults)
.minScore(minScore)
.build();
return DefaultRetrievalAugmentor.builder()
.contentRetriever(contentRetriever)
.build();
}
}
Step 9: Create CLI Commands
Main Command
@TopCommand
@CommandLine.Command(
mixinStandardHelpOptions = true,
subcommands = {InitCommand.class, InvestigateCommand.class, IngestionStatusCommand.class})
public class DoctorCommand {}
InitCommand.java
@CommandLine.Command(
name = "init",
description = "Initialize the knowledge base by ingesting the codebase")
public class InitCommand implements Runnable {
@CommandLine.Option(
names = {"--repository"},
required = true,
description = "Path to the repository to ingest")
private Path repositoryPath;
@CommandLine.Option(
names = {"--force"},
description = "Force re-ingestion even if already ingested")
private boolean force;
@Inject IngestionService ingestionService;
@Override
public void run() {
Log.info("Initializing knowledge base");
if (!Files.exists(repositoryPath)) {
Log.errorf("Path does not exist: %s", repositoryPath);
return;
}
if (!force && IngestionMetadata.isIngested(repositoryPath.toString())) {
Log.info("Repository already ingested. Use --force to re-ingest.");
return;
}
ingestionService.ingestDirectory(repositoryPath, force);
Log.info("Initialization complete");
}
}
InvestigateCommand.java
@CommandLine.Command(
name = "investigate",
description = "Investigate a test failure using the AI agent")
public class InvestigateCommand implements Runnable {
@CommandLine.Parameters(
index = "0",
description = "Name of the test to investigate or module path")
private String testNameOrModule;
@Inject DoctorAgent agent;
@Inject IngestionService ingestionService;
@Override
public void run() {
String repoBasePath = getRepositoryBasePath();
if (repoBasePath == null) {
Log.error("No ingestion found. Run 'init' command first.");
return;
}
Log.infof("Investigating test: %s", testNameOrModule);
// Find and read Surefire report
String errorMessage = findAndReadSurefireReport(repoBasePath, testNameOrModule);
// Build context for the agent
StringBuilder context = new StringBuilder();
context.append("Repository: ").append(repoBasePath).append("\n\n");
context.append("## Test Failure\n\n");
context.append("```\n");
context.append(errorMessage != null ? errorMessage : "No surefire report found");
context.append("```\n\n");
context.append("Investigate this failure and find the root cause.\n");
// Call the AI agent
Log.info("Agent analysis:");
Log.info("--------------------------------------------------------------------------------");
String response = agent.chat(context.toString());
Log.info(response);
Log.info("--------------------------------------------------------------------------------");
}
private String getRepositoryBasePath() {
List<IngestionMetadata> ingestions = ingestionService.listAllIngestions();
return ingestions.isEmpty() ? null : ingestions.get(0).sourcePath;
}
private String findAndReadSurefireReport(String repoPath, String testName) {
// Logic to find and parse Surefire XML reports
// Returns the error message and stack trace
// Implementation omitted for brevity
return "Test failure details...";
}
}
Step 10: Build and Run
Build the project:
mvn clean package
Start services:
# For Gemini (cloud)
docker-compose up -d pgvector
# For Ollama (local)
docker-compose up -d
docker exec -it ollama-container ollama pull mistral
Initialize the knowledge base:
# With Gemini (requires API key)
export GEMINI_API_KEY="your-key-here"
java -jar target/*-runner.jar init --repository=/path/to/your/project
# With Ollama (local)
java -Dquarkus.profile=ollama -jar target/*-runner.jar init --repository=/path/to/your/project
Investigate a test failure:
# With Gemini
java -jar target/*-runner.jar investigate YourTestName
# With Ollama
java -Dquarkus.profile=ollama -jar target/*-runner.jar investigate YourTestName
Example Output
Investigating test: ContractsComponentTest
Agent analysis:
--------------------------------------------------------------------------------
Problem: The subscription_number field returns "7181111" instead of expected "71811".
Extra "11" characters appended to the subscription number.
Code found in ContractService.java:
public StatusResponse createPartnerContract(PartnerEntitlementsRequest request) {
var subscription = new SubscriptionEntity();
subscription.setSubscriptionNumber(request.getSubscriptionNumber() + "11"); // BUG
subscriptionRepository.persist(subscription);
return StatusResponse.success();
}
Fix: Remove the "+ \"11\"" from line 245. Should be:
subscription.setSubscriptionNumber(request.getSubscriptionNumber());
--------------------------------------------------------------------------------
Cloud vs Local LLMs: A Comparison
After implementing both Google Gemini (cloud) and Ollama (local) support, here are the key differences observed:
Google Gemini (gemini-1.5-flash)
Pros:
- Excellent quality: Consistently identifies bugs correctly, including subtle issues like commented code or extra characters
- Fast: Response time ~2-5 seconds
- Native function calling: Executes tools correctly without confusion
- Free tier: 1,500 requests/day, 1M tokens/minute
- Easy setup: Just requires an API key
Cons:
- Internet required: Cannot work offline
- Privacy: Code is sent to Google’s servers
- Rate limits: Limited to free tier quotas
Ollama (Local Models)
We tested three models:
llama3.2:1b (1.3GB)
- Quality: ❌ Poor - doesn’t understand instructions, provides generic responses
- Speed: ✅ Fast (~10-20s)
- Function calling: ❌ Fails - doesn’t execute tools properly
- Verdict: Not suitable for this use case
llama3.1:8b (4.7GB)
- Quality: ⚠️ Mixed - sometimes works, often returns JSON instead of executing tools
- Speed: ⚠️ Slow (~30-60s)
- Function calling: ⚠️ Limited - understands tools but doesn’t always execute them
- Verdict: Marginal for simple cases
mistral:7b (4GB)
- Quality: ⚠️ Mixed - provides generic analysis, rarely uses tools effectively
- Speed: ⚠️ Medium (~15-30s)
- Function calling: ⚠️ Limited - native support but inconsistent execution
- Verdict: Better than llama but still inferior to Gemini
Recommendation
For production use with complex function calling:
- Primary choice: Google Gemini - superior quality, speed, and reliability
- For experimentation: Ollama with Mistral - best local option, but expect lower quality
- For privacy-critical scenarios: Consider larger Ollama models (70B+) if you have sufficient hardware (40GB+ RAM)
The gap between cloud and local models for function calling is significant. Cloud models like Gemini have been specifically trained and optimized for tool use, while smaller local models struggle with the complexity of deciding when and how to call functions.
Key Learnings
- Function Calling is Complex: Smaller LLMs (7B-8B parameters) struggle with complex function calling scenarios. They often:
- Return JSON descriptions instead of executing functions
- Provide generic responses without using available tools
- Get confused with multiple tool options
- Prompt Engineering Matters: The system prompt is crucial for guiding the LLM:
- Be explicit about the process (step 1, 2, 3…)
- Provide examples of tool usage
- Define clear output format
- Emphasize critical rules
- RAG Enhances Context: Combining function calling with RAG provides:
- Semantic search across the codebase
- Relevant code snippets without explicit file paths
- Better understanding of project structure
- Cloud vs Local Trade-offs:
- Cloud: Better quality, faster, but privacy concerns
- Local: Full privacy, unlimited usage, but lower quality and slower
Future Direction: Model Context Protocol (MCP)
While our implementation uses direct function calling through LangChain4j, it’s worth discussing how this architecture could evolve with the Model Context Protocol (MCP), an emerging open standard for AI-tool integration.
What is MCP?
MCP is a protocol developed by Anthropic that standardizes how AI applications connect to external systems. Think of it as a universal adapter for AI agents to interact with tools, similar to how USB-C standardized device connections.
In our current implementation, we created custom @Tool methods for file operations, git commands, and database queries. With MCP, these would become standardized servers that any AI application can use.
Conceptual Shift
Current Approach: Direct Function Calling
Our agent directly implements tools as Java methods. Each tool is tightly coupled to our application:
AI Agent → FileTools.java → File System
→ GitTools.java → Git Repository
→ DatabaseTools.java → PostgreSQL
Pros:
- Simple and straightforward
- Fast (no network overhead)
- Easy to debug
- Full control over implementation
Cons:
- Tools are specific to this project
- Cannot share tools with other AI applications
- Must implement everything from scratch
- Limited to Java ecosystem
MCP Approach: Standardized Tool Servers
With MCP, tools become independent servers that communicate via a standard protocol:
AI Agent → MCP Client → MCP Server: Filesystem
→ MCP Server: Git
→ MCP Server: Database
Pros:
- Reusability: Same MCP servers work with Claude Desktop, VS Code, custom agents
- Ecosystem: Growing library of community-maintained servers
- Standardization: No need to reinvent common operations
- Language agnostic: Servers can be written in Python, TypeScript, Go, etc.
- Security: Built-in sandboxing and access control
Cons:
- Complexity: Requires running separate server processes
- Performance: Network/IPC overhead for each tool call
- Maturity: Still evolving, limited Java support
- Debugging: Cross-process communication is harder to trace
How Our Agent Would Fit
Instead of implementing FileTools, GitTools, and DatabaseTools ourselves, we could leverage existing MCP servers:
- Filesystem Server: Handles
readFile(),searchInFile(),listDirectory() - Git Server: Provides
getCommits(),getDiff(),getBlame() - Database Server: Executes queries, returns schemas
Our custom code would focus solely on domain-specific logic:
- Parsing Surefire reports
- Identifying test locations
- Formatting investigation reports
- Orchestrating the investigation workflow
Real-World Analogy
Think of it like microservices for AI tools:
- Current approach: Monolithic application where all functionality is built-in
- MCP approach: Microservices where each tool is an independent service
Just as microservices enabled teams to share and reuse backend services, MCP enables sharing AI tools across different applications.
Why This Matters
Imagine you build three AI agents:
- Test failure investigator (this project)
- Code review assistant
- Documentation generator
Without MCP: Implement file reading, git operations, and database queries three times.
With MCP: Write once, reuse everywhere. All three agents connect to the same MCP servers.
Current State & Future
Today (2025):
- MCP is gaining adoption in Python/TypeScript ecosystems
- Limited Java/LangChain4j support
- Best for new projects planning multi-agent systems
Our Choice: We chose direct function calling because:
- Single-purpose agent (test investigation only)
- Performance matters (fast feedback on test failures)
- Simpler to understand and maintain
- Full control over tool behavior
Future Consideration: If you’re building multiple AI agents or want to integrate with existing MCP tools, migrating to MCP would make sense. The architecture is already modular enough to support this transition.
Trade-off Summary
| Consideration | Direct Function Calling | MCP Integration |
|---|---|---|
| Best for | Single-purpose agents | Multi-agent systems |
| Setup | Simple | Complex |
| Performance | Fast | Slower |
| Reusability | Low | High |
| Ecosystem | Limited | Growing |
| Learning curve | Easy | Moderate |
Looking Ahead
MCP represents an interesting evolution in AI tooling. As the ecosystem matures and Java support improves, we might see:
- Native MCP integration in LangChain4j
- Enterprise MCP servers for internal tools
- Standardized tool marketplaces
- Better security and auditing
For now, direct function calling serves our needs well. But MCP is worth watching as a potential future direction, especially if you’re planning to build multiple AI agents that could share common tooling.
Note: A future blog post could explore implementing this same agent using MCP, comparing the practical differences in code, performance, and developer experience.
Conclusion
Building an AI agent for test failure investigation demonstrates the power of combining:
- LangChain4j for LLM integration
- Function calling for dynamic tool use
- RAG for codebase context
- Quarkus for clean, efficient Java applications
While cloud models currently provide superior results for complex function calling, local models are improving rapidly. The architecture presented here supports both, allowing you to choose based on your priorities: quality and speed (Gemini) or privacy and cost (Ollama).
The Model Context Protocol (MCP) represents the future of AI-tool integration, offering standardization, reusability, and a growing ecosystem. While our current implementation uses direct function calling for simplicity and performance, migrating to MCP would enable broader tool reuse and integration with the expanding MCP ecosystem.
The complete source code and detailed setup instructions are available in the project repository.
