Hubbry Logo
Characterization testCharacterization testMain
Open search
Characterization test
Community hub
Characterization test
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Characterization test
Characterization test
from Wikipedia

In computer programming, a characterization test (also known as Golden Master Testing[1]) is a means to describe (characterize) the actual behavior of an existing piece of software, and therefore protect existing behavior of legacy code against unintended changes via automated testing. This term was coined by Michael Feathers.[2]

Overview

[edit]

The goal of characterization tests is to help developers verify that the modifications made to a reference version of a software system did not modify its behavior in unwanted or undesirable ways. They enable, and provide a safety net for, extending and refactoring code that does not have adequate unit tests.

In James Bach's and Michael Bolton's classification of test oracles,[3] this kind of testing corresponds to the historical oracle. In contrast to the usual approach of assertions-based software testing, the outcome of the test is not determined by individual values or properties (that are checked with assertions), but by comparing a complex result of the tested software-process as a whole with the result of the same process in a previous version of the software. In a sense, characterization testing inverts traditional testing: Traditional tests check that individual properties have certain values (whitelists them), whereas characterization testing checks that no properties have been changed (blacklisted).

When creating a characterization test, one must observe what outputs occur for a given set of inputs. Given an observation that the legacy code gives a certain output based on given inputs, then a test can be written that asserts that the output of the legacy code matches the observed result for the given inputs. For example, if one observes that f(3.14) == 42, then this could be created as a characterization test. Then, after modifications to the system, the test can determine if the modifications caused changes in the results when given the same inputs.

Unfortunately, as with any testing, it is generally not possible to create a characterization test for every possible input and output. As such, many people opt for either statement or branch coverage. However, even this can be difficult. Test writers must use their judgment to decide how much testing is appropriate. It is often sufficient to write characterization tests that only cover the specific inputs and outputs that are known to occur, paying special attention to edge cases.

Unlike regression tests, to which they are very similar, characterization tests do not verify the correct behavior of the code, which can be impossible to determine. Instead they verify the behavior that was observed when they were written. Often no specification or test suite is available, leaving only characterization tests as an option, since the conservative path is to assume that the old behavior is the required behavior. Characterization tests are, essentially, change detectors. It is up to the person analyzing the results to determine if the detected change was expected and/or desirable, or unexpected and/or undesirable.

One of the interesting aspects of characterization tests is that, since they are based on existing code, it's possible to generate some characterization tests automatically. An automated characterization test tool will exercise existing code with a wide range of relevant and/or random input values, record the output values (or state changes) and generate a set of characterization tests. When the generated tests are executed against a new version of the code, they will produce one or more failures/warnings if that version of the code has been modified in a way that changes a previously established behavior.

When testing on the GUI level, characterization testing can be combined with intelligent monkey testing to create complex test cases that capture use cases and special cases thereof.

Advantages

[edit]

Golden Master testing has the following advantages over the traditional assertions-based software testing:

  • It is relatively easy to implement for complex legacy systems.
  • As such allows refactoring.
  • It is generally a sensible approach for complex results such as PDFs, XML, images, etc. where checking all relevant attributes with assertions would be both insensible due to the amount of attributes and result in unreadable/unmaintainable test code.

Disadvantages

[edit]

Golden Master testing has the following disadvantages over traditional assertions-based software testing:

  • It depends on repeatability. Volatile and non-deterministic values need to be masked / removed, both from the Golden Master as well as from the result of the process. If too many elements need to be removed or removing them is too complex, it can render Golden Master testing impractical.
  • It depends not only on the software to be repeatable but also on the stability of the environment and input values.
  • Golden Master testing does not infer correctness of the results. It merely helps detect unwanted effects of software changes.

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A characterization test is a technique that documents and verifies the actual current behavior of existing code, rather than specifying its intended or desired behavior, enabling developers to refactor legacy systems safely without introducing unintended changes. The concept was introduced by Michael Feathers in his influential 2004 book Working Effectively with Legacy Code, where it serves as a foundational strategy for handling untested or poorly documented codebases that lack traditional unit tests. Unlike specification tests, which define expected outcomes based on requirements, characterization tests act as a "safety net" by capturing empirical outputs from the code under various inputs, thus providing a baseline for future modifications. Key aspects of characterization testing include its application to legacy code—defined by Feathers as any code without tests—where it helps identify dependencies, reveal hidden behaviors, and support incremental improvements like extraction of functionality or bug fixes. Developers typically write these tests in a harness using frameworks like , asserting against observed results, and may employ tools such as analyzers (e.g., ) to ensure comprehensive path coverage. Benefits encompass reduced risk during refactoring, enhanced understanding of complex systems, and facilitation of retrofits, though they require careful validation to avoid perpetuating flaws in the original code. In practice, heuristics guide their use: testing targeted areas of change, verifying extractions case-by-case, and confirming overall functionality post-modification.

Introduction

Definition

A characterization test is a technique designed to document and capture the actual current behavior of existing code, focusing on observed outputs for given inputs rather than verifying against expected or ideal specifications. This approach establishes a baseline by writing assertions that match the code's current execution results, effectively "approving" them as the accepted standard for that behavior. Coined by Michael C. Feathers in his seminal 2004 book Working Effectively with Legacy Code, the technique is particularly applied to legacy systems, where the original intent may be unclear, providing a safety net for subsequent modifications. Known by synonyms such as Golden Master Testing and Approval Testing, characterization tests emphasize empirical observation over prescriptive validation. The key principle involves running the code under test conditions, recording the outputs, and incorporating them into the as the "golden" reference; any deviation in future runs signals a potential regression or change that requires review. In contrast to traditional unit tests, which often use a white-box method to inspect and confirm internal logic aligns with design specifications, characterization tests operate on a black-box basis, treating the code as an opaque component and prioritizing external over details.

Purpose

Characterization tests serve as a critical mechanism to safeguard the of undocumented or legacy software from unintended alterations during refactoring, updates, or maintenance activities. By capturing and verifying the current outputs of the system under various inputs, these tests establish a reliable baseline that ensures modifications do not disrupt established functionality. This approach is particularly valuable in environments where original requirements are lost or unclear, allowing developers to proceed with changes confidently while preserving the software's operational integrity. In the context of test-driven development applied to untested codebases, enable an incremental testing strategy by first documenting the existing state of the code. This initial characterization acts as a foundation, permitting subsequent alterations—such as adding new features or optimizing performance—without introducing regressions, as any deviation from the captured behavior triggers immediate feedback. As Michael Feathers describes, the core purpose is "to document your system's actual behavior, not check for the behavior you wish your system had," thereby facilitating a shift from untested legacy components to a more robust, testable structure. On a broader scale, the objective of characterization tests is to mitigate risks associated with evolving complex software systems by creating a reproducible and verifiable record of current behaviors. This baseline not only supports ongoing development but also aids in and compliance efforts, where consistency with prior states is essential. In scenarios involving "" code—where internal logic is opaque and requirements are unknown—these tests ensure that production functionality remains intact, even as the system undergoes necessary evolution.

Background

Origins in Software Testing

Characterization tests emerged in the early 2000s amid the rise of agile methodologies and (TDD), which emphasized iterative development and the need for reliable feedback on code changes in existing systems. These tests addressed a gap in traditional TDD practices, where writing tests before code was challenging for untested legacy systems, by instead capturing and documenting current behavior to enable safe refactoring. A pivotal milestone came in 2004 with Michael Feathers' book Working Effectively with Legacy Code, which formalized characterization testing as a technique for adding automated tests to untried codebases by asserting against observed outputs rather than preconceived expectations. Feathers described these tests as tools to "characterize" actual system behavior, revealing bugs or inconsistencies during the process while providing a baseline for future modifications. This approach quickly gained traction in agile communities for its practicality in real-world scenarios involving brownfield projects. In the , characterization testing evolved through integration with approval testing frameworks, which streamlined the capture and comparison of complex outputs like data structures or UI renders against approved "golden" files. Created by Llewelyn Falco, these frameworks, such as ApprovalTests, extended the technique by automating the approval workflow, making it more accessible for diverse languages and reducing manual effort in verifying behavioral snapshots. This period saw broader adoption as part of pipelines, emphasizing regression prevention in evolving codebases. By 2025, advancements in empirical characterization testing introduced data-driven validation methods, focusing on gathering observable evidence from legacy code to build robust test suites post-development. Mark Seemann's blog series highlighted techniques for empirical test-after practices, such as iteratively refining tests based on runtime evidence to enhance reliability without upfront specifications. These developments underscore the technique's maturation toward evidence-based software maintenance. Characterization tests draw roots from regression testing, which verifies that code changes do not break existing functionality, but adapt the concept for behavioral documentation over strict specification enforcement. Unlike traditional regression tests that assume predefined correct behaviors, characterization tests prioritize capturing as-is outputs to establish a verifiable status quo, facilitating safer evolution of untested systems.

Relation to Legacy Code

Legacy code refers to untested and often poorly documented software systems whose behaviors are not well understood, thereby posing significant risks during modifications as changes may inadvertently alter expected outputs or introduce defects. Characterization tests mitigate these risks by systematically capturing and asserting the current outputs of legacy code, effectively "approving" its existing behavior as a reference point for future changes. This enables safer incremental refactoring, such as through the Strangler Application pattern, where developers can gradually replace legacy components with new implementations without necessitating a full system rewrite, thereby reducing the scope and cost of maintenance efforts. These tests complement other legacy code strategies, including the use of seams—specific insertion points in the code that allow observation or alteration of behavior without source modifications—to create testable boundaries and manage dependencies. By serving as a foundational safety net, characterization tests counteract the inherent principle that modifications to untested code frequently introduce bugs, facilitating evolutionary development where systems are iteratively improved rather than overzealously rebuilt. In the context of 2025 enterprise software landscapes, where AI-assisted codebases are proliferating and exacerbating legacy maintenance challenges, characterization tests gain heightened relevance by providing essential behavioral baselines to ensure stability amid rapid technological integrations.

Methodology

Steps for Implementation

Implementing characterization tests involves a systematic to capture and preserve the current, often undocumented, of existing code, particularly in legacy systems without prior automated tests. This technique, introduced by Michael Feathers, enables developers to establish a baseline for refactoring while minimizing the risk of unintended changes. The begins with identifying areas of code whose needs to support safe modifications. The first step is to select a target code segment or function exhibiting unknown or unpredictable behavior. Developers observe typical inputs to the code, either through manual execution or by adding temporary to capture the actual outputs produced under current conditions. This empirical observation ensures the test reflects real-world usage without assuming correctness of the behavior. Next, construct the test code to invoke the target function with the identified inputs and include assertions that verify the outputs match the previously captured results. For complex or non-deterministic outputs, such as formatted reports or data structures, employ techniques like string matching or approval-style comparisons to handle variability precisely. This step creates an automated check against the observed baseline. Once written, execute the tests to confirm they pass, thereby validating that they accurately represent the existing behavior. During subsequent refactoring or modifications, rerun the tests continuously; their passage indicates that the core functionality remains intact, allowing developers to proceed confidently. Finally, maintain the tests by updating assertions only when a deliberate behavioral change is intended, such as fixing a bug or enhancing features. Any test failure in this phase serves as a clear signal of a potential regression, prompting investigation before proceeding. This disciplined approach treats the tests as a protective mechanism for legacy behavior. As a best practice, initiate characterization testing at the high-level integration layer, such as end-to-end scenarios, before progressing to finer-grained unit tests; this broader scope establishes overall system stability with fewer initial assumptions about internal dependencies.

Tools and Frameworks

ApprovalTest libraries provide a foundational approach to characterization testing by automating the comparison of actual outputs against approved "golden master" files, often using file-based diff tools for visualization. These libraries, such as ApprovalTests for C++, Java, and .NET, enable developers to capture complex outputs like strings, collections, or even images and verify them against baselines without manual assertions for each element. The process involves generating a received file from the code under test and comparing it to an approved file; if they differ, integrated reporters launch diff tools like Beyond Compare or VS Code for review. This file-based mechanism is particularly effective for legacy code where outputs are unpredictable or voluminous, as it supports scrubbing sensitive data and handling non-deterministic elements through configurable strategies. Snapshot testing frameworks extend similar principles to dynamic environments, capturing and serializing outputs such as responses or UI renders for regression verification. In , Jest's built-in snapshot testing allows tests to match component outputs against stored snapshots, updating them manually upon intentional changes. For Swift development, the SnapshotTesting library supports a wide range of strategies, including diffs for views and text diffs for models, making it suitable for app where visual fidelity is key. These frameworks emphasize ease of adoption by integrating natively, though they require careful management of snapshot files to avoid bloat in . Characterization testing often embeds within established unit test runners via extensions, enhancing compatibility without overhauling workflows. For , ApprovalTests integrates seamlessly with 3, 4, and 5 through simple annotations like @UseApprovalTesting, allowing golden master assertions alongside traditional tests. In Python, the pytest-approval plugin extends pytest by providing approval fixtures and diff tool hooks, such as integration with PyCharm's built-in comparator. Similarly, for .NET, ApprovalTests.Net works with via attributes that automate file comparisons, supporting parallel execution and custom reporters. These integrations ensure characterization tests run in pipelines with minimal configuration, leveraging the runners' discovery and reporting features. As of 2025, emerging IDE plugins are streamlining characterization testing by automating baseline generation from runtime behaviors. The ApprovalTests Support plugin for adds context menu actions for resolving failed approvals directly in the editor, such as viewing diffs or updating baselines, reducing manual intervention for and Kotlin projects. Additionally, tools like UnitTestBot leverage code analysis to suggest and generate characterization-style tests from inferred behaviors, including runtime traces for empirical baselines in unsupported legacy modules. These plugins prioritize developer ergonomics, with features like auto-tracing execution paths to create initial snapshots without explicit input specification. When selecting tools, developers must weigh file-based versus inline approvals based on output scale and maintainability. File-based approvals, common in ApprovalTests and Jest, excel for large or binary data by storing snapshots externally, facilitating visual s but risking repository clutter if not versioned properly. Inline approvals, supported in libraries like SnapshotTesting, embed expected values directly in code for simpler s and easier refactoring, though they become unwieldy for expansive outputs like full responses. Compatibility with tools and formats remains crucial for cross-platform teams.

Benefits and Limitations

Advantages

Characterization tests enable the rapid testing of untested legacy code by capturing its current behavior without requiring detailed upfront specifications or deep understanding of internal logic, often allowing developers to establish a more quickly than traditional methods. This approach, as described by Michael Feathers, focuses on observing outputs for given inputs, making it particularly suitable for black-box systems such as APIs or user interfaces where internal implementation details are opaque or complex. A key advantage is the provision of regression protection, serving as a safety harness during refactoring and modifications by detecting unintended behavioral changes early in the development process. By documenting the "as-is" state of the code, these tests promote greater confidence in making changes, enabling evolutionary improvements and incremental refactoring without the risks associated with large-scale rewrites. Furthermore, characterization tests enhance cost-effectiveness in maintaining legacy systems by providing a clear, record of existing functionality. This supports broader adoption in environments with opaque or evolving components, fostering reliable development workflows without extensive initial .

Disadvantages

Characterization tests, by design, capture and baseline the existing behavior of code without asserting its correctness, which can lead to the perpetuation of bugs if the initial outputs include defects that are not manually reviewed and addressed. This approach documents actual system behavior rather than verifying intended functionality, potentially embedding flaws into the unless developers actively intervene to update or refine the baselines. A significant maintenance overhead arises from the need to manually approve and update baselines whenever intentional changes are made to the , particularly in fast-paced development environments where frequent modifications can turn these tests into a bottleneck. While characterization tests offer quick setup compared to traditional unit tests, this advantage is offset by the ongoing effort required to manage evolving outputs, especially for complex or large-scale systems. These tests exhibit brittleness when applied to code involving non-deterministic elements, such as , external dependencies, or time-sensitive operations, as they demand exact output matches that may vary across runs without appropriate mocking or isolation techniques. In such cases, failures are common, and workarounds like asymmetric matchers or seeded are often necessary, adding complexity to test maintenance. Furthermore, tests provide incomplete coverage by focusing solely on the behaviors observed during their creation, often overlooking edge cases, rare conditions, or unexercised code paths that were not part of the initial testing scope. This limitation means they serve as a starting point for understanding legacy systems but cannot replace comprehensive testing strategies to ensure robustness across all scenarios.

Applications and Examples

Use Cases

Characterization tests play a crucial role in modernization, where monolithic applications must be refactored without interrupting ongoing operations. These tests capture existing outputs to verify that refactoring preserves core functionalities during upgrades to cloud-native or modular architectures. For endpoint testing, characterization tests are effective in documenting response schemas for integrations with third-party services, especially when original specifications are outdated or unavailable. By generating assertions on outputs, developers can detect deviations during updates, ensuring seamless in distributed systems without relying on incomplete . In UI/UX validation for web applications, characterization tests often take the form of snapshot testing to record and compare rendered components, guaranteeing visual consistency across updates, devices, and browsers. This method is particularly beneficial for frontend-heavy applications, where subtle layout shifts could degrade ; studies show snapshot testing reduces visual bugs by automating baseline comparisons, though it requires careful management of test maintenance. Characterization tests support migration by safeguarding service behaviors during the of monolithic applications, allowing teams to extract and isolate components while confirming identical inputs and outputs. This approach aligns with incremental patterns, where legacy monoliths are gradually replaced, preventing regressions in service contracts and enabling scalable, independent deployments.

Practical Examples

A simple characterization test can be applied to a legacy function that es strings, such as one that converts input to uppercase and appends exclamation marks. Consider the function def process(text): return text.upper() + '!!'. A test might assert that process("hello") yields "HELLO!!", capturing the current to prevent unintended changes during . For a more complex scenario, characterization tests in can simulate calls from legacy endpoints to verify response structures. For instance, a test for a submitAssignment endpoint might mock a database update and verify the response, such as { "status": "submitted", "id": 123 }, ensuring the endpoint's remains consistent across refactors. In refactoring demonstrations, characterization tests help verify behaviors during modifications without breaking surrounding code. For example, tests can be written to characterize a repository update method, sabotaging parts of the code to ensure assertions fail as expected, then confirming they pass after reverting, allowing safe refactoring of dependencies. Handling output differences often involves approval testing tools that generate s for review. Using libraries like Verify in C# or similar in other languages, a test might capture serialized results from a validation function into a snapshot file; if a refactor alters the output (e.g., adding a new field to a result object), the tool launches a diff viewer to compare received vs. approved files, allowing developers to approve intentional changes. For edge cases involving non-deterministic behavior, such as functions using , characterization tests adapt by seeding the random generator before execution to ensure reproducible outputs. For example, in Python's random module, setting random.seed(42) prior to calling a function that shuffles a list allows the test to assert against a consistent shuffled result, like [3, 1, 4, 1, 5] for input [1, 3, 4, 1, 5], capturing the seeded behavior reliably.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.