AI in Software Testing

Robert Fey

Dec 07, 2025 / 4 min read

Introduction

“AI does not change the laws of testing. It accelerates whatever your architecture already does.”

AI-generated tests are rapidly entering embedded automotive development — from classic ECU logic to safety-critical state machines and model-based control logic. 

The promise is appealing: 

  • Lower test development cost
  • Faster feedback cycles
  • Broader functional and structural coverage
  • Reduced dependency on manual test design

But despite impressive generation speed, AI does not fix the core challenges of software testing. It amplifies the strengths and weaknesses of the underlying test architecture. 

This article explains why, and provides a rigorous conceptual foundation for organizations preparing to adopt AI-driven testing safely and effectively.

1. Every Test Has Two Logical Components: Stimulation and Intent

Across all tools, domains, and notations, every software test consists of exactly two elements:

Stimulation Layer

“How we provoke behavior.” 

This includes all inputs and execution conditions applied to the system under test: 

  • API calls 
  • Signal trajectories 
  • Timing sequences 
  • Mode switches 
  • Environment conditions 
  • State initialization 

The Stimulation Layer is implementation-coupled and highly volatile. It must change whenever: 

  • The code is refactored
  • Timing behavior shifts
  • Integration behavior changes
  • Interfaces evolve

Intent Layer (Expected Behavior / Invariants)

“How we judge behavior.” 

Intent includes: 

  • Functional invariants
  • State-machine correctness
  • Timing and hysteresis rules
  • Safety constraints
  • Output validity
  • Logical correctness over time

Intent is requirement-coupled and low-volatility. Its lifecycle is tied to: 

  • Functional truth 
  • Domain rules 
  • Safety requirements 
  • Product variants 

Intent is not step-based expected values. It is the truth model that determines whether behavior is correct.

When both layers are stored inside a single artifact — the classical test case — their incompatible lifecycles become forced to evolve together. This is the structural root of drift.

2. False Positives vs. False Negatives — and Why AI Makes the First Category Critical

AI increases the number of generated tests dramatically. But it also increases the number of evaluation risks.

False Positives (dangerous)

A test says behavior is correct, even though it is wrong. Causes include:

  • Missing, ambiguous, or incomplete requirements
  • Weak expected values
  • Tolerance windows that accidentally hide defects
  • Incorrect or overly generalized domain assumptions
  • AI “smoothing away” edge cases
  • Expected values derived from code structure instead of functional intent
  • Implicit assumptions not encoded into the Intent Layer

False Positives hide defects. They create: 

  • Misleading confidence
  • Meaningless coverage metrics
  • Defects only discovered in HiL, vehicle tests, or customer fleets
  • The most expensive debugging scenarios

A False Positive is a silent failure of the testing process itself.

False Negatives (annoying, but repairable)

A test says behavior is wrong, even though the system is correct. Causes include: 

  • Overly strict thresholds 
  • Incomplete environment setup 
  • Incorrect timing windows 
  • Overly narrow or misaligned invariants 

False Negatives trigger unnecessary debugging, but they do not hide defects. They cost time — not safety.

3. Where Intent Drift Comes From: Lifecycle Mismatch

“Tests do not drift because humans make mistakes. They drift because their architecture binds incompatible lifecycles.” 

The three components involved have fundamentally different rates of change:

Component Lifecycle Driver
Stimulation High Code changes, refactoring, integration behavior 
Intent Low Requirements, safety rules, functional invariants 
Logic (SUT Execution Behavior) Medium Implementation evolution

When Stimulation and Intent live inside one artifact: 

  1.  Every code change → forces updates to stimulation

  2.  Every stimulation update → touches expected values

  3.  Every touched expected value → risks weakening intent

  4.  Accumulated over time → tests align with code, not requirements

This is Intent Drift.

Formal Definition

Intent Drift is the progressive misalignment between test expectations and functional requirements, caused by architectural coupling of fast-changing stimulation with slow-changing intent.

4. Why Classical Test Architectures Inevitably Drift - Especially Under AI

Most embedded and unit-test notations use a step-based test case structure:

TestCase { 
  Stimulus Step 
  Expected Result 
  Stimulus Step 
  Expected Result 
  ... 

This structurally binds Stimulation and Intent. 

Consequences

  • A small timing change → dozens of expected values must be updated 
  • Hysteresis adjustments → tolerances widened 
  • Refactoring → cascaded expectation edits 
  • Unclear transitions → generalized assertions (“ANY”, wide ranges) 

With AI-driven generation, the pattern becomes worse 

  • AI rewrites expectations to match observed behavior
  • AI broadens thresholds to “stabilize” test performance
  • AI generalizes away edge cases (“patterns” learned from data)
  • AI aligns stimuli and expectations for internal consistency, not truth

Because AI optimizes for consistency, not semantics, drift becomes amplified. Explainable AI helps understand why the model made a choice — but it cannot determine whether the choice matches functional truth. Explainability increases transparency — but it cannot compensate for an architecture that couples fast-changing stimulation with slow-changing intent.

5. The 3-Layer Architecture: The Only Scalable Foundation for AI-Based Testing

To eliminate drift, the architecture must separate responsibilities into three layers:

Layer 1 — Stimulation
  • Implementation-coupled
  • High change rate
  • Coverage-oriented
  • AI-friendly

Examples:

  • TASMO-generated sequences
  • Search-based inputs
  • Random exploration
  • Boundary scanning
  • Context sequencing
Layer 2 — Intent
  • Requirement-coupled
  • Low change rate
  • Stable invariants
  • Explainable, audit-ready

Examples:

Intent must be scoped to one requirement or invariant at a time, enabling explainability and correctness.

Layer 3 — Logic (System Under Test Execution)
  • Pure execution behavior
  • No embedded expected values
  • No test logic
Key Principle: Changes in one layer must not force updates in another. That is what makes drift impossible.

6. Why the 3-Layer Architecture Makes AI Testing Safe

Once stimulation, intent, and logic are decoupled, AI-generated stimulation becomes safe and these are the worst cases to occur:

  • Some tests don’t execute
  • Some explore irrelevant space

But defects cannot be hidden. AI-generated intent becomes traceable because each intent definition corresponds to a single requirement or invariant. Explainable AI can show why the invariant fired or did not fire —because the invariant is independent of the stimuli.

False Positives Drop Dramatically

Weak expected values cannot hide behind step-based coupling.

False Negative Become Cheap

Fixing one invariant fixes all stimulating scenarios.

Costs Scale Linearly

Stimulation complexity does not multiply expected-value complexity.

7. Best Practices for Safe, Scalable AI Test Generation

To make AI a quality amplifier instead of a drift amplifier:

  1. Generate Stimulation and Intent Separately - Never in the same prompt. Never in the same artifact.
  2. Allow AI to Explore Stimulation Broadly - Coverage, variability, stress, sequence exploration.
  3. Constrain AI-generated Intent - To one requirement or invariant per definition.
  4. Use Invariant-based Notations - Structured, explainable, reusable.
  5. Apply Architecture as the Primary Safety Barrier

Reviews and explainability help — but cannot prevent drift in a coupled system. Only architectural separation can.

8. The Strategic Conclusion

“AI does not protect a broken testing architecture. It accelerates the consequences.”

If your test system binds StimulationIntent, and Logic into a single artifact, then AI will accelerate drift, multiply hidden false positives, blur tolerances, and increase late-stage debugging cost.

But if you adopt the 3-Layer Architecture:

  • AI becomes a force multiplier for coverage
  • False positives become rare
  • Drift becomes structurally impossible
  • Explainability becomes meaningful
  • Quality scales predictably

This is the fundamental fork in the road for modern software testing.


Looking for deterministic test execution alongside your AI workflows?
Explore how TPT’s robust architecture keeps test logic separated to ensure reliable, reproducible results across your full verification stack.

Continue Reading