Hubbry Logo
ReDoSReDoSMain
Open search
ReDoS
Community hub
ReDoS
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
ReDoS
ReDoS
from Wikipedia

A regular expression denial of service (ReDoS)[1] is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression and/or an input that takes a long time to evaluate. The attack exploits the fact that many[2] regular expression implementations have super-linear worst-case complexity; on certain regex-input pairs, the time taken can grow polynomially or exponentially in relation to the input size. An attacker can thus cause a program to spend substantial time by providing a specially crafted regular expression and/or input. The program will then slow down or become unresponsive.[3][4]

Description

[edit]

Regular expression ("regex") matching can be done by building a finite-state automaton. Regex can be easily converted to nondeterministic automata (NFAs), in which for each state and input symbol, there may be several possible next states. After building the automaton, several possibilities exist:

  • the engine may convert it to a deterministic finite-state automaton (DFA) and run the input through the result;
  • the engine may try one by one all the possible paths until a match is found or all the paths are tried and fail ("backtracking").[5][6]
  • the engine may consider all possible paths through the nondeterministic automaton in parallel;
  • the engine may convert the nondeterministic automaton to a DFA lazily (i.e., on the fly, during the match).

Of the above algorithms, the first two are problematic. The first is problematic because a deterministic automaton could have up to states where is the number of states in the nondeterministic automaton; thus, the conversion from NFA to DFA may take exponential time. The second is problematic because a nondeterministic automaton could have an exponential number of paths of length , so that walking through an input of length will also take exponential time.[7] The last two algorithms, however, do not exhibit pathological behavior.

Note that for non-pathological regular expressions, the problematic algorithms are usually fast, and in practice, one can expect them to "compile" a regex in time and match it in time; instead, simulation of an NFA and lazy computation of the DFA have worst-case complexity.[a] Regex denial of service occurs when these expectations are applied to a regex provided by the user, and malicious regular expressions provided by the user trigger the worst-case complexity of the regex matcher.

While regex algorithms can be written in an efficient way, most regex engines in existence extend the regex languages with additional constructs that cannot always be solved efficiently. Such extended patterns essentially force the implementation of regex in most programming languages to use backtracking.

Examples

[edit]

Exponential backtracking

[edit]

The most severe type of problem happens with backtracking regular expression matches, where some patterns have a runtime that is exponential in the length of the input string.[8] For strings of characters, the runtime is . This happens when a regular expression has three properties:

  • the regular expression applies repetition (+, *) to a subexpression;
  • the subexpression can match the same input in multiple ways, or the subexpression can match an input string which is a prefix of a longer possible match;
  • and after the repeated subexpression, there is an expression that matches something which the subexpression does not match.

The second condition is best explained with two examples:

  • in (a|a)+$, repetition is applied to the subexpression a|a, which can match a in two ways on each side of the alternation.
  • in (a+)*$, repetition is applied to the subexpression a+, which can match a or aa, etc.

In both of these examples we used $ to match the end of the string, satisfying the third condition, but it is also possible to use another character for this. For example (a|aa)*c has the same problematic structure.

All three of the above regular expressions will exhibit exponential runtime when applied to strings of the form . For example, if you try to match them against aaaaaaaaaaaaaaaaaaaaaaaax on a backtracking expression engine, it will take a significantly long time to complete, and the runtime will approximately double for each extra a before the x.

It is also possible to have backtracking which is polynomial time , instead of exponential. This can also cause problems for long enough inputs, though less attention has been paid to this problem as malicious input must be much longer to have a significant effect. An example of such a pattern is "a*b?a*c", when the input is an arbitrarily long sequence of "a"s.

Vulnerable regexes in online repositories

[edit]

So-called "evil" or vulnerable regexes have been found in online regular expression repositories. Note that it is enough to find a vulnerable subexpression in order to attack the full regex:

  1. RegExLib, id=1757 (email validation) – see red part
    ^([a-zA-Z0-9])(([\-.]|[_]+)?([a-zA-Z0-9]+))*(@){1}[a-z0-9]+[.]{1}(([a-z]{2,3})|([a-z]{2,3}[.]{1}[a-z]{2,3}))$
  2. OWASP Validation Regex Repository, Java Classname – see red part
    ^(([a-z])+.)+[A-Z]([a-z])+$

These two examples are also susceptible to the input aaaaaaaaaaaaaaaaaaaaaaaa!.

Attacks

[edit]

If the regex itself is affected by user input, such as a web service permitting clients to provide a search pattern, then an attacker can inject a malicious regex to consume the server's resources. Therefore, in most cases, regular expression denial of service can be avoided by removing the possibility for the user to execute arbitrary patterns on the server. In this case, web applications and databases are the main vulnerable applications. Alternatively, a malicious page could hang the user's web browser or cause it to use arbitrary amounts of memory.

However, if a vulnerable regex exists on the server-side already, then an attacker may instead be able to provide an input that triggers its worst-case behavior. In this case, e-mail scanners and intrusion detection systems could also be vulnerable.

In the case of a web application, the programmer may use the same regular expression to validate input on both the client and the server side of the system. An attacker could inspect the client code, looking for evil regular expressions, and send crafted input directly to the web server in order to hang it.[9]

Mitigation

[edit]

ReDoS can be mitigated without changes to the regular expression engine, simply by setting a time limit for the execution of regular expressions when untrusted input is involved.[10]

ReDoS can be avoided entirely by using a non-vulnerable regular expression implementation. After CloudFlare's web application firewall (WAF) was brought down by a PCRE ReDoS in 2019, the company rewrote its WAF to use the non-backtracking Rust regex library, using an algorithm similar to RE2.[11][12]

Vulnerable regular expressions can be detected programmatically by a linter.[13] Methods range from pure static analysis[14][15] to fuzzing.[16] In most cases, the problematic regular expressions can be rewritten as "non-evil" patterns. For example, (.*a)+ can be rewritten to ([^a]*a)+. Possessive matching and atomic grouping, which disable backtracking for parts of the expression,[17] can also be used to "pacify" vulnerable parts.[18][19]

Linear-time (finite automata) regex

[edit]

While some regex libraries do not have built-in defence against ReDoS attacks, such as C++ Standard Library <regex>, C POSIX library <regex.h>[20] or Boost boost.regex (which use backtracking, leading to exponential time), other regex libraries are engineered for preventing regex denial of service attacks. This is done using deterministic finite automata, which run in linear time relative to the input size.

Using the RE2 library by Google for C++:[21]

import <re2/re2.h>;

import std;

using std::string;
using re2::RE2;

int main(int argc, char* argv[]) {
    string text = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!"
    string pattern = "(a+)+$";
    bool match = RE2::FullMatch(text, pattern);
    std::println("Match result: {}", match);
}

Using the regex crate for Rust:[22]

use regex::Regex;

fn main() {
    // Regex::new() returns Result<Regex, Error> and must be unwrapped
    let re: Regex = Regex::new(r"^(a+)+$").unwrap();
    let matches: bool = re.is_match("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!");
    println!("Match result: {}", matches);
}

Regex match timeout

[edit]

Timeouts can be implemented to cancel regex tasks if they take too long.

package org.wikipedia.examples;

import java.util.concurrent.*;
import java.util.regex.*;

public class Example {
    public static boolean matchesWithTimeout(String regex, String input, long timeoutMillis) {
        ExecutorService executor = Executors.newSingleThreadExecutor();

        Future<Boolean> future = executor.submit(() -> {
            Pattern pattern = Pattern.compile(regex);
            Matcher matcher = pattern.matcher(input);
            return matcher.matches();
        });

        try {
            return future.get(timeoutMillis, TimeUnit.MILLISECONDS);
        } catch (TimeoutException e) {
            System.err.println("Regex evaluation timed out!");
            return false;
        } catch (InterruptedException | ExecutionException e) {
            e.printStackTrace();
            return false;
        } finally {
            future.cancel(true); // Stop the thread
            executor.shutdownNow();
        }
    }

    public static void main(String[] args) {
        String regex = "(a+)+$";
        String input = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!";
        boolean result = matchesWithTimeout(regex, input, 100); // 100 ms timeout
        System.out.printf("Match result: %s%n", result);
    }
}

Timeouts are built in to the .NET standard library, as the class System.Text.RegularExpressions.Regex supports a property MatchTimeout.[23] The following is an example in C#:

namespace Wikipedia.Examples;

using System;
using System.Text.RegularExpressions;

public class Example
{
    static void Main(string[] args)
    {
        string pattern = @"(a+)+$";
        string input = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaX";
    
        try
        {
        	Regex regex = new(pattern, RegexOptions.None, TimeSpan.FromMilliseconds(100));
            bool match = regex.IsMatch(input);
	        Console.WriteLine($"Match result: {match}");
        }
        catch (RegexMatchTimeoutException ex)
        {
	        Console.WriteLine($"Regex operation timed out! {ex.Message}");
        }
    }
}

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Regular expression denial of service (ReDoS) is a form of algorithmic complexity attack that exploits vulnerabilities in the implementation of s, causing software to consume excessive computational resources—often exponentially more than expected—and leading to denial-of-service conditions such as system hangs or crashes. This vulnerability arises primarily from mechanisms in (NFA) regex engines, which are common in languages like , Python, and .NET, where certain input strings trigger an explosion of evaluation paths. Attackers craft short, malicious inputs to force this behavior, making ReDoS a low-bandwidth that can disrupt web applications, servers, and even client-side processing without requiring large payloads. The broader concept of algorithmic denial-of-service attacks traces back to 2003, while ReDoS specifically was first formally outlined in 2009 at the Israel Conference, where researchers highlighted its practical risks in web applications. In practice, ReDoS manifests through catastrophic backtracking, where nested quantifiers or overlapping alternatives in a regex pattern—such as (a+)+$—lead to repeated failures and retries for inputs like repeated 'a's followed by a non-matching character (e.g., "aaaaaaaaaaaaaaaa!"). For instance, a seemingly simple pattern like ^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*@(([0-9a-zA-Z])+([-\w]*[0-9a-zA-Z])*\.)+[a-zA-Z]{2,9})$ for email validation can take seconds or hours to process crafted strings due to the engine exploring thousands of paths. As of 2018, empirical studies revealed ReDoS as a widespread issue, with super-linear regex patterns appearing in thousands of software modules across ecosystems like (e.g., ) and Python (e.g., PyPI), affecting over 10,000 packages and potentially enabling attacks on diverse applications from web servers to input validators. More recent analyses as of 2024 show continued widespread impact, with a 143% increase in reported ReDoS vulnerabilities in the ecosystem and approximately 400 related CVEs since 2014. The impact includes resource exhaustion at multiple layers—client-side (e.g., browser hangs), server-side (e.g., CPU spikes in ), and even in security tools like firewalls—often without detection until exploited. Mitigation strategies emphasize avoiding vulnerable patterns, such as nested repetitions or ambiguous quantifiers, and adopting safer engines like deterministic finite automata (DFA) where possible, alongside techniques like input length limits, timeouts, and automated tools to test for risks. Despite advances in detection tools, developers often revise regexes manually rather than overhaul parsing logic, underscoring ongoing challenges in preventing ReDoS at scale.

Fundamentals

Definition

Regular expressions, often abbreviated as regex, are formal patterns used in programming languages and tools to match, search, and manipulate strings of text through operations like concatenation, alternation, and repetition. These patterns enable efficient text processing in applications such as input validation, data extraction, and search functionalities across various software systems. Regular expression denial of service (ReDoS) is an algorithmic complexity attack that exploits vulnerabilities in the processing of regular expressions by causing excessive computational resource consumption, effectively denying service to legitimate users. In ReDoS, attackers craft malicious inputs that trigger inefficient evaluation in regex engines, particularly those relying on backtracking, resulting in catastrophic slowdowns where processing time escalates exponentially with input size. This leads to worst-case time complexities as high as O(2^n), where n is the input length, in contrast to the typical linear or polynomial time for well-behaved patterns. Unlike traditional denial-of-service attacks that overwhelm systems through high-volume or resource flooding, ReDoS targets the inherent inefficiencies of regex implementations in specific application components, such as web servers or validators, allowing a single, small input to monopolize CPU cycles and halt operations. This makes ReDoS particularly insidious in software relying on user-supplied inputs for , as it exploits the gap between average-case efficiency and pathological worst-case behavior in backtracking-based engines.

Backtracking Mechanism

Backtracking in simulates the execution of a non-deterministic finite (NFA) through an input-directed , allowing the engine to handle in the by exploring multiple matching paths. The process begins at the NFA's initial state, where the engine attempts to consume the input string symbol by symbol. For each state, if a direct match (e.g., a literal character) succeeds, it advances to the next state; however, when non-determinism arises—such as with alternatives (denoted by |) or quantifiers—the engine prioritizes one path (typically the left or maximum repetition) and proceeds recursively. If the chosen path reaches a dead end (i.e., no further match is possible for the remaining input), the engine backtracks to the most recent choice point, restores the previous state, and tries the next alternative, continuing until an accepting path is found or all possibilities are exhausted. This mechanism relies on maintaining a stack of states to enable backtracking, effectively turning the NFA simulation into a recursive traversal that retries failed branches. For instance, in handling unions, the engine first explores the initial alternative fully; upon failure, it unwinds the stack and shifts to the subsequent one. Similarly, for closures like Kleene stars (*), the engine recursively applies the body of the repetition, backtracking through iteration counts when the overall pattern cannot proceed. To avoid infinite loops from epsilon transitions (empty moves), many engines track visited states within a cycle before consuming input, ensuring progress. Greedy and lazy quantifiers play a key role in shaping the backtracking paths by dictating the order of repetition attempts. A greedy quantifier, such as *, instructs the engine to match the maximum possible repetitions of its preceding element before advancing, committing to longer matches first and only by reducing the count one by one if later parts of the pattern fail. In contrast, a lazy quantifier, like *?, starts with the minimum (often zero) repetitions and incrementally increases them only when required for the overall match to succeed, reversing the sequence to favor shorter initial commitments. This difference in prioritization can significantly alter the depth and breadth of the search tree explored, though both ultimately rely on the same retry mechanism. Mathematically, manifests as a tree over the NFA's state space, where each internal node represents a partial match state with pending choices, and edges denote transitions triggered by input consumption or alternatives. The is the starting state, with branches for each non-deterministic option (e.g., quantifier iterations from maximum downward for greedy cases), leading to leaves that are either accepting matches or failures. For with overlapping ambiguities, such as nested repetitions or alternatives, the tree's node count can expand exponentially—proportional to the of repetition bounds in worst cases—as the engine exhaustively enumerates path combinations before concluding. This growth underscores the non-linear complexity inherent in simulations of NFAs. The backtracking approach traces its origins to Henry Spencer's public-domain library developed in the , which introduced this NFA-based recursive method as an alternative to earlier (DFA) implementations. This library was integrated into Perl's regex engine around 1991, marking a pivotal adoption that emphasized flexibility for extended regex features over strict linear performance. The mechanism has since proliferated, forming the core of engines in languages like , which enumerates NFA paths in priority order per specifications, and Python, whose re module employs similar depth-first traversal for .

Vulnerabilities

Common Patterns

Common patterns vulnerable to ReDoS typically involve syntactic constructs in regular expressions that introduce ambiguity, allowing regex engines to explore an exponential number of matching paths on certain inputs. These patterns exploit the nondeterministic nature of the expressions, where multiple ways to consume the input string lead to redundant computations during the matching process. Among the most prevalent vulnerable patterns are nested quantifiers, such as (a+)+, where inner repetitions can be grouped in various ways, creating overlapping possibilities for matching sequences of 'a' characters. Another common construct is optional groups with alternations, exemplified by (a|aa)+?, which permits non-greedy matching that branches into multiple partial matches for strings with repeated substrings. Overlapping subexpressions, such as quantified overlapping disjunctions like (\w|\d)+, further amplify this by allowing the engine to try numerous combinations of word characters and digits in ambiguous sequences. These patterns are characterized by their star height greater than one or quantified overlaps, which inherently support multiple valid parses of the input. The ambiguity in these patterns causes path explosion because backtracking engines, which attempt all possible ways to advance through the expression, generate a combinatorial number of execution paths. For instance, in a pattern like (a|a)* matching a string of 'a's followed by a non-match, the engine may explore up to Θ(2n)\Theta(2^n) paths for an input of length nn, as each position can be attributed to either branch of the alternation. This can be illustrated through a simplified pseudocode representation of a backtracking matcher:

function match(pattern, input, pos, state): if pos == len(input) and state is accepting: return true if no more transitions from state: return false for each possible transition from state: new_pos = pos + consumed_length(transition) if new_pos <= len(input) and match(pattern, input, new_pos, next_state(transition)): return true // Backtrack: try alternatives or rewind return false

function match(pattern, input, pos, state): if pos == len(input) and state is accepting: return true if no more transitions from state: return false for each possible transition from state: new_pos = pos + consumed_length(transition) if new_pos <= len(input) and match(pattern, input, new_pos, next_state(transition)): return true // Backtrack: try alternatives or rewind return false

In this recursive , ambiguous cause the function to branch repeatedly, with each level potentially doubling the number of calls, leading to exponential in the worst case. Studies indicate that such vulnerable patterns are widespread in open-source codebases. For example, an analysis of nearly 4,000 Python projects on found that while 42% of projects use regular expressions, a significant portion incorporate potentially vulnerable ones, with broader surveys estimating 10-20% of regexes across ecosystems like and Python as susceptible to super-linear behavior. In the ecosystem, approximately 1% of unique regexes exhibit super-linear complexity, yet they appear in over 13,000 modules, highlighting their prevalence despite low per-regex incidence. Similarly, examinations of programs revealed that about 20% of regexes are vulnerable. Differences in regex engine implementations exacerbate these risks. PCRE, a widely used backtracking engine compatible with Perl syntax, fully supports features like nested quantifiers and alternations, enabling the path explosion described above but at the cost of potential exponential runtime. In contrast, RE2 employs a linear-time automata-based approach without , rejecting ambiguous constructs like backreferences to guarantee O(nO(n matching time, though this limits its expressiveness compared to PCRE.

Exploitation Examples

A classic illustration of a ReDoS vulnerability involves the regular expression ^(a+)+$, which intends to match one or more repetitions of one or more 'a' characters at the start and end of a string. When tested against an input like "aaaa!", the backtracking engine explores numerous failure paths due to the nested quantifiers + inside another +. For a smaller input such as "aa!", the trace begins with the engine greedily matching "aa" as a single a+ group within the outer (a+)+, then failing at the $ anchor upon encountering "!". It backtracks by reducing the inner a+ to "a" and attempting a second group with the remaining "a", but fails again at $. Further backtracking tries splitting into single "a" groups or other combinations, resulting in 5 total attempts; scaling to "aaaa!" expands this exponentially to over 15 attempts, demonstrating the quadratic blowup in computation. In practical code, such vulnerabilities often appear in input validation functions using nested or overlapping groups. For instance, a email validator employing the pattern /("[^"]*"|[^@])*@[^@]*/ can hang on crafted inputs with repeated quotes, as the engine backtracks across the alternatives [^"]* and [^@]* to find non-matches before the @. The following snippet exemplifies this:

javascript

const emailRegex = /("[^"]*"|[^@])*@[^@]*/; function isValidEmail(email) { return emailRegex.test(email); } // Vulnerable: isValidEmail('""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""@example.com'); // This input triggers extensive [backtracking](/page/Backtracking), potentially delaying execution by seconds or more.

const emailRegex = /("[^"]*"|[^@])*@[^@]*/; function isValidEmail(email) { return emailRegex.test(email); } // Vulnerable: isValidEmail('""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""@example.com'); // This input triggers extensive [backtracking](/page/Backtracking), potentially delaying execution by seconds or more.

This pattern, drawn from real-world usage, highlights how seemingly simple validation logic exposes applications to denial-of-service via user-supplied strings. Analysis of open-source repositories reveals widespread presence of such vulnerable patterns. A study of 865 popular modules on npm identified 161 (19%) containing super-linear regexes susceptible to ReDoS, often in parsing or validation routines. More recent systematic reviews confirm ReDoS as the fourth most common server-side vulnerability in npm packages from 2014 to 2023, with approximately 400 related CVEs disclosed across ecosystems, underscoring persistent risks in dependency chains despite growing awareness. Tools like safe-regex and ReDoSHunter have been applied in scans to flag these, revealing nested quantifiers as a frequent culprit in project codebases. In production environments, attackers deliver malicious inputs through common web channels to exploit these flaws. For example, crafted strings can be injected into HTTP headers like User-Agent or Content-Type, where modules such as ua-parser-js or charset process them with vulnerable regexes, causing server delays of up to minutes per request. Similarly, form submissions or payloads targeting input sanitization—such as email fields in forms or bodies—allow probes like repeated characters to trigger ; an analysis of 475 API-publishing domains found 6 vulnerable to such exploits, leading to patches in services from and AWS. These vectors enable low-effort denial-of-service by overwhelming single-threaded runtimes like with minimal payload sizes, often as small as 38 characters.

Impacts

Performance Effects

Regular Expression Denial of Service (ReDoS) exploits the mechanism in many regex engines, leading to exponential in the worst case. In implementations, the engine explores multiple possible matches by trying different combinations of quantifiers and alternatives, which can result in an explosion of computation paths. For patterns with nested quantifiers, such as (a+)+b matched against a long string of 'a's without a trailing 'b', the number of steps grows exponentially with the input length nn. This can be modeled as T(n)c2dT(n) \approx c \cdot 2^d, where cc is a constant representing the base cost per step, and dd approximates the number of decision points (often proportional to nn), yielding an overall O(2n)O(2^n) complexity. Benchmarks illustrate this impact on modest hardware, such as a mid-2010s . For instance, a 30-character invalid input to a vulnerable like ^(.*?,){n}P can take 1.8 seconds to process, spiking CPU usage to 99%, while a valid input completes in 0.05 seconds. Extending the input by one character often doubles the runtime, escalating to several seconds for 31 characters and potentially hours for longer strings due to billions of steps—for a 20-character input, one engine requires over 3.6 million steps, and for 100 characters, the theoretical steps exceed 101810^{18}. consumption also rises from stack frames for each backtrack level, potentially leading to stack overflows in languages like pre-5.10 . In multi-threaded applications, such as those using , a single ReDoS-triggering input blocks the event loop thread, causing widespread delays and service unavailability as other requests queue up. Performance graphs typically show runtime versus input size as an exponential curve: near-linear for small nn (e.g., milliseconds), but vertical escalation beyond a threshold (e.g., seconds to minutes at n=20n=20), contrasting sharply with safe patterns that remain flat. In comparison, linear-time alternatives like Thompson's NFA construction avoid altogether, simulating the NFA by tracking sets of states as the input is processed once, achieving O(mn)O(m n) where mm is the pattern length. This makes NFA-based engines, such as RE2, immune to ReDoS while handling the same inputs in microseconds even for large nn, orders of magnitude faster than vulnerable in worst cases.

Real-World Incidents

One of the earliest prominent real-world ReDoS incidents occurred in July 2016, when experienced a 34-minute outage affecting its entire platform. The disruption was triggered by a user-submitted post containing a malicious string that exploited a vulnerable designed to trim whitespace, specifically ^[\s\u200c]+|[\s\u200c]+$, leading to excessive and CPU exhaustion on the servers. This event highlighted the risks of processing untrusted user input with complex regex patterns in high-traffic web applications, resulting in temporary unavailability for millions of users worldwide. In July 2019, Cloudflare encountered a major global outage lasting approximately 27 minutes, impacting a significant portion of the internet's traffic routed through its network. The incident stemmed from a newly deployed (WAF) rule incorporating a with catastrophic behavior, such as .*.*=, which caused full CPU utilization on edge servers handling HTTP/HTTPS requests. This affected millions of websites and services, underscoring the potential for ReDoS to cascade into widespread downtime when introduced in critical infrastructure components like content delivery networks. Following these high-profile cases, awareness of ReDoS has grown, particularly since , with automated security scanners increasingly identifying vulnerabilities in open-source supply chains. Security analyses indicate ReDoS ranks as the fourth most common server-side vulnerability class in JavaScript ecosystems such as . Hundreds of CVEs have been assigned to ReDoS across various libraries and frameworks as of 2023, reflecting broader adoption of static analysis for regex auditing in software development pipelines. In 2024, ReDoS vulnerabilities continued to emerge in AI-related tools, exemplified by CVE-2024-12720 in the Transformers , versions up to 4.46.3. This flaw in the tokenizer's post_process_single() function in tokenization_nougat_fast.py allowed crafted inputs to trigger exponential , potentially causing denial-of-service in AI inference workflows. The vulnerability affected deployments relying on the library for understanding tasks, emphasizing ReDoS risks in emerging supply chains. As of 2025, ReDoS remains prevalent, with ongoing reports of vulnerabilities in diverse software ecosystems.

Mitigation

Design Strategies

To prevent Regular Expression Denial of Service (ReDoS) vulnerabilities, developers should follow principles that minimize or eliminate in regex patterns, particularly by avoiding constructs that lead to exponential . Key strategies include using atomic groups, which prevent backtracking within a subpattern once it has matched, and quantifiers, which apply greedy matching without allowing subsequent backtracking to release characters. These features, supported in engines like PCRE and Java's regex, ensure that quantified elements do not revisit prior matches, thereby bounding the execution time to linear in the input length for supported patterns. Additionally, nested unbounded repeats, such as (a+)+ or (a*)*, should be avoided as they introduce infinite and catastrophic backtracking; these are identified as high-risk anti-patterns like Star-3 in theoretical analyses of regex . Refactoring techniques transform ambiguous patterns into deterministic equivalents that execute in linear time without altering the intended matches. For instance, the nested repeat (a+)+ can be refactored to the simpler a+, eliminating the outer quantifier's potential while preserving behavior for strings of 'a' characters. Similarly, patterns with overlapping concatenations, such as a*(aa)* (), can be rewritten to a+ to remove redundant where multiple subpatterns generate overlapping languages. These refactors rely on equivalence checks via finite automata, ensuring no loss of expressiveness while preventing super-linear runtime; empirical studies show such revisions resolve over 70% of identified vulnerabilities in practice. Selecting regex engines with guaranteed linear-time performance is a foundational choice for ReDoS prevention. Google's RE2 library implements a approach that evaluates alternatives in parallel without , achieving O(m * n) where m is the regex and n is the input , and explicitly omits features like backreferences that enable exponential behavior. Similarly, Rust's regex crate compiles patterns to a Thompson NFA with lazy DFA evaluation, providing worst-case O(m * n) guarantees for searches and supporting limits on untrusted patterns to cap compilation overhead. These engines prioritize safety for production use, contrasting with backtracking-based libraries like PCRE, and are recommended for applications handling untrusted input. Testing practices should be integrated into the software development lifecycle (SDLC) to verify regex security proactively. Developers must include unit tests with large, adversarial inputs—such as strings exceeding 1,000 characters designed to trigger backtracking—to detect slowdowns exceeding linear scaling, often using timeouts to fail failing tests. Input length limits, enforced before regex evaluation (e.g., rejecting strings longer than a fixed threshold), serve as a simple first-line defense, proven effective in over 20% of historical vulnerability fixes by bounding resource usage. Embedding these tests in continuous integration pipelines ensures ongoing validation, with benchmarks on diverse inputs confirming linear performance.

Detection Tools

Static analyzers identify potential ReDoS vulnerabilities by examining patterns without executing them, often modeling depth through analysis of nested quantifiers, ambiguity in alternatives, and polynomial-time complexity bounds. Tools like use pattern-matching rules to flag vulnerable regexes, such as those with repeated capturing groups containing optional elements, supporting multiple languages including Python and . RegexStaticAnalysis employs static techniques to detect catastrophic in expressions like (ht|f)tp(s?)://[^\s]*, evaluating structural risks in real-world codebases. Similarly, RXXR2, an extension of earlier regex analyzers, uses pumping lemma-based to bound execution time and identify exponential blowups. Dynamic testing tools complement static methods by executing regexes against crafted inputs to measure runtime behavior and confirm vulnerabilities. Fuzzers such as regexploit generate adversarial strings heuristically to trigger excessive backtracking, providing proof-of-concept exploits for patterns like ^((mailto:)?[\w.%+-]+@[\w.-]+\.[\w]{2,20})?$. ReScue applies genetic algorithms to evolve input strings that maximize processing time, effectively discovering ReDoS triggers in non-trivial cases by iteratively mutating candidates based on fitness scores derived from execution duration. Benchmarks evaluating these tools show varying accuracy, with advanced analyzers like RENGAR achieving near-complete detection of known vulnerabilities—detecting all those found by nine prior tools plus 3 to 197 times more in large datasets—while maintaining low false positives through principled modeling of backtracking states. In comparative tests on benchmark sets of vulnerable expressions, static tools like RegexStaticAnalysis identified up to 46% of cases, though hybrid approaches often exceed 90% recall in recent studies. Many tools, including and CodeQL, integrate seamlessly with CI/CD pipelines via plugins for Actions or Jenkins, enabling automated scanning during code reviews to catch ReDoS risks in legacy codebases early.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.