Recent from talks
Contribute something
Nothing was collected or created yet.
URI normalization
View on Wikipedia

URI normalization is the process by which URIs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URI into a normalized URI so it is possible to determine if two syntactically different URIs may be equivalent.
Search engines employ URI normalization in order to correctly rank pages that may be found with multiple URIs, and to reduce indexing of duplicate pages. Web crawlers perform URI normalization in order to avoid crawling the same resource more than once. Web browsers may perform normalization to determine if a link has been visited or to determine if a page has been cached. Web servers may also perform normalization for many reasons (i.e. to be able to more easily intercept security risks coming from client requests, to use only one absolute file name for each resource stored in their caches, named in log files, etc.).
Normalization process
[edit]There are several types of normalization that may be performed. Some of them are always semantics preserving and some may not be.
Normalizations that preserve semantics
[edit]The following normalizations are described in RFC 3986 [1] to result in equivalent URIs:
- Converting percent-encoded triplets to uppercase. The hexadecimal digits within a percent-encoding triplet of the URI (e.g.,
%3aversus%3A) are case-insensitive and therefore should be normalized to use uppercase letters for the digits A-F.[2] Example:
http://example.com/foo%2a→http://example.com/foo%2A
- Converting the scheme and host to lowercase. The scheme and host components of the URI are case-insensitive and therefore should be normalized to lowercase.[3] Example:
HTTP://User@Example.COM/Foo→http://User@example.com/Foo
- Decoding percent-encoded triplets of unreserved characters. Percent-encoded triplets of the URI in the ranges of ALPHA (
%41–%5Aand%61–%7A), DIGIT (%30–%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) do not require percent-encoding and should be decoded to their corresponding unreserved characters.[4] Example:
http://example.com/%7Efoo→http://example.com/~foo
- Removing dot-segments. Dot-segments
.and..in the path component of the URI should be removed by applying the remove_dot_segments algorithm[5] to the path described in RFC 3986.[6] Example:
http://example.com/foo/./bar/baz/../qux→http://example.com/foo/bar/qux
- Converting an empty path to a "/" path. In presence of an authority component, an empty path component should be normalized to a path component of "/".[7] Example:
http://example.com→http://example.com/
- Removing the default port. An empty or default port component of the URI (port 80 for the
httpscheme) with its ":" delimiter should be removed.[7] Example:
http://example.com:80/→http://example.com/
Normalizations that usually preserve semantics
[edit]For http and https URIs, the following normalizations listed in RFC 3986 may result in equivalent URIs, but are not guaranteed to by the standards:
- Adding a trailing "/" to a non-empty path. Directories (folders) are indicated with a trailing slash and should be included in URIs. Example:
http://example.com/foo→http://example.com/foo/- However, there is no way to know if a URI path component represents a directory or not. RFC 3986 notes that if the former URI redirects to the latter URI, then that is an indication that they are equivalent.
Normalizations that change semantics
[edit]Applying the following normalizations result in a semantically different URI although it may refer to the same resource:
- Removing directory index. Default directory indexes are generally not needed in URIs. Examples:
http://example.com/a/index.html→http://example.com/a/http://example.com/default.asp→http://example.com/
- Removing the fragment. The fragment component of a URI is never seen by the server and can sometimes be removed. Example:
http://example.com/bar.html#section1→http://example.com/bar.html- However, AJAX applications frequently use the value in the fragment.
- Replacing IP with domain name. Check if the IP address maps to a domain name. Example:
http://208.77.188.166/→http://example.com/- The reverse replacement is rarely safe due to virtual web servers.
- Limiting protocols. Limiting different application layer protocols. For example, the “https” scheme could be replaced with “http”. Example:
https://example.com/→http://example.com/
- Removing duplicate slashes Paths which include two adjacent slashes could be converted to one. Example:
http://example.com/foo//bar.html→http://example.com/foo/bar.html
- Removing or adding “www” as the first domain label. Some websites operate identically in two Internet domains: one whose least significant label is “www” and another whose name is the result of omitting the least significant label from the name of the first, the latter being known as a naked domain. For example,
http://www.example.com/andhttp://example.com/may access the same website. Many websites redirect the user from the www to the non-www address or vice versa. A normalizer may determine if one of these URIs redirects to the other and normalize all URIs appropriately. Example:
http://www.example.com/→http://example.com/
- Sorting the query parameters. Some web pages use more than one query parameter in the URI. A normalizer can sort the parameters into alphabetical order (with their values), and reassemble the URI. Example:
http://example.com/display?lang=en&article=fred→http://example.com/display?article=fred&lang=en- However, the order of parameters in a URI may be significant (this is not defined by the standard) and a web server may allow the same variable to appear multiple times.[8]
- Removing unused query variables. A page may only expect certain parameters to appear in the query; unused parameters can be removed. Example:
http://example.com/display?id=123&fakefoo=fakebar→http://example.com/display?id=123- Note that a parameter without a value is not necessarily an unused parameter.
- Removing default query parameters. A default value in the query string may render identically whether it is there or not. Example:
http://example.com/display?id=&sort=ascending→http://example.com/display
- Removing the "?" when the query is empty. When the query is empty, there may be no need for the "?". Example:
http://example.com/display?→http://example.com/display
Normalization based on URI lists
[edit]Some normalization rules may be developed for specific websites by examining URI lists obtained from previous crawls or web server logs. For example, if the URI
http://example.com/story?id=xyz
appears in a crawl log several times along with
http://example.com/story_xyz
we may assume that the two URIs are equivalent and can be normalized to one of the URI forms.
Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URIs with similar text) rules that can be applied to URI lists. They showed that once the correct DUST rules were found and applied with a normalization algorithm, they were able to find up to 68% of the redundant URIs in a URI list.
See also
[edit]- URL (Uniform Resource Locator)
- URI fragment
- Web crawler
References
[edit]- ^ RFC 3986, Section 6. Normalization and Comparison
- ^ RFC 3986, Section 6.2.2.1. Case Normalization
- ^ RFC 3986, Section 6.2.2.1. Case Normalization
- ^ RFC 3986, Section 6.2.2.3. Path Segment Normalization
- ^ RFC 3986, 5.2.4. Remove Dot Segments
- ^ RFC 3986, 6.2.2.3. Path Segment Normalization
- ^ a b RFC 3986, Section 6.2.3. Scheme-Based Normalization
- ^ "jQuery 1.4 $.param demystified". Ben Alman. December 20, 2009. Retrieved August 24, 2013.
- RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
- Sang Ho Lee; Sung Jin Kim & Seok Hoo Hong (2005). On URL normalization (PDF). Proceedings of the International Conference on Computational Science and its Applications (ICCSA 2005). pp. 1076–1085. Archived from the original (PDF) on September 18, 2006.
- Uri Schonfeld; Ziv Bar-Yossef & Idit Keidar (2006). Do not crawl in the dust: different URLs with similar text. Proceedings of the 15th international conference on World Wide Web. pp. 1015–1016.
- Uri Schonfeld; Ziv Bar-Yossef & Idit Keidar (2007). Do not crawl in the dust: different URLs with similar text. Proceedings of the 16th international conference on World Wide Web. pp. 111–120.
URI normalization
View on GrokipediaFundamentals
Definition and Purpose
URI normalization is the process of syntactically transforming a Uniform Resource Identifier (URI) into a canonical form that preserves its semantic meaning, thereby resolving variations in representation that could otherwise lead to inconsistent handling. This involves applying standardized rules to eliminate syntactic ambiguities, such as case differences in certain components or redundant path segments, ensuring that equivalent URIs are treated as identical.[5][6] The primary purpose of URI normalization is to facilitate reliable comparison, storage, deduplication, and processing of URIs across diverse systems, including web caches, databases, and security filters. By standardizing URI representations, normalization minimizes discrepancies that arise from different encoding practices or input methods, enabling applications to accurately determine when two URIs refer to the same resource without false distinctions. This is particularly essential in distributed environments like the World Wide Web, where URIs serve as fundamental identifiers for resources.[7][4] Historically, URI normalization emerged from the architectural needs of the web to handle resource identification consistently, with its principles first formalized in RFC 2396 in August 1998, which defined generic syntax and equivalence rules for URIs. These concepts were subsequently refined and expanded in RFC 3986 in January 2005, providing a more comprehensive framework for normalization while obsoleting the earlier specification. The evolution addressed growing complexities in URI usage, such as relative references and percent-encoding variations.[8][1] Key benefits of URI normalization include reducing false negatives in URI matching—where equivalent resources might otherwise be treated as distinct—and enhancing overall efficiency in distributed systems by promoting uniformity and reducing storage overhead from duplicate representations. This standardization supports critical web operations, such as caching and linking, by ensuring semantic equivalence is verifiable through syntactic comparison.[4][9]URI Syntax Overview
A Uniform Resource Identifier (URI) follows a generic syntax defined asURI = scheme ":" hier-part [ "?" query ] [ "#" fragment ], where the hierarchical part (hier-part) may include an authority component prefixed by "//".[10] This structure allows URIs to reference resources across various protocols and systems, with components that can introduce syntactic variations affecting equivalence comparisons.[11]
The scheme component identifies the protocol or resource type, such as "http" or "ftp", and is followed by a colon; it is case-insensitive and conventionally represented in lowercase.[12] For schemes requiring an authority, the hier-part begins with "//" followed by the authority, which consists of optional user information (userinfo), a host, and an optional port: [ userinfo "@" ] host [ ":" port ].[13] The userinfo subcomponent, typically a username and password separated by a colon (e.g., "user:pass@"), is deprecated due to security concerns.[14] The host represents the domain name, IP literal, or IPv4 address (e.g., "example.com" or "[2001:db8::1]"), and is case-insensitive.[15] The port, if present, is a decimal integer indicating the server's port number (e.g., ":8080").[16]
Following the authority (if any), the path component denotes the hierarchical location of the resource, consisting of one or more segments separated by slashes ("/"); it can be absolute (starting with "/") or relative.[17] For example, "/documents/resource.txt" represents a file path. The optional query component, introduced by "?", carries non-hierarchical data often as key-value pairs (e.g., "?id=123&name=example"), with its format scheme-dependent.[18] The fragment identifier, starting with "#", references a secondary resource or section within the primary one (e.g., "#section1"), and its interpretation is document-specific.[19]
Sources of variability in URI syntax include case sensitivity rules: the scheme and host are case-insensitive, while the path, query, and fragment are case-sensitive unless specified otherwise by the scheme.[20] Percent-encoding allows reserved or unreserved characters to be encoded as "%" followed by two hexadecimal digits (e.g., "%20" for a space), providing flexibility but multiple encoding options for the same octet sequence.[21] Path delimiters are strictly forward slashes ("/") in generic URI syntax, though some schemes or implementations may tolerate backslashes, leading to inconsistencies.[17] URIs are constrained to US-ASCII characters, with Internationalized Resource Identifiers (IRIs) extending this to Unicode by allowing non-ASCII characters that must then be percent-encoded when mapping to URIs.[22]
Normalization Principles
Semantic Equivalence
Semantic equivalence in URIs refers to the principle that two URIs identify the same resource if they resolve to it, irrespective of syntactic variations in their representation.[7] This determination relies on scheme-specific rules for comparison, as equivalence is assessed through normalized string matching rather than full semantic analysis of the underlying resource.[4] For instance, under the HTTP scheme, URIs such ashttp://example.com and http://example.com/ are equivalent because the trailing slash does not alter the identified resource, treating an empty path as equivalent to a root path "/".[9]
Equivalence classes group URIs that, after applying normalization, produce identical strings under the governing scheme's rules.[2] These classes account for differences like case insensitivity in hostnames (e.g., "Example.com" vs. "example.com" for most schemes) or default port omissions (e.g., http://example.com:80/ equivalent to http://example.com/).[9] Scheme-specific behaviors are crucial here; for example, the "mailto" scheme may permit case variations in local parts, while generic syntax emphasizes syntactic normalization to define these classes without runtime resolution in many cases.[9]
In URI normalization, semantic equivalence ensures that transformations preserve resource identification, avoiding alterations that could lead to false negatives in comparisons.[7] Per RFC 3986 Section 6, normalization applies syntax-based rules (e.g., percent-encoding consistency) and scheme-based adjustments before equivalence checks, enabling reliable deduplication in applications like caching or linking.[2]
However, semantic equivalence is not purely syntactic; some determinations require runtime mechanisms, such as DNS resolution for hostnames or HTTP redirects, which fall outside static normalization.[4] Relative URIs, for example, depend on a base URI for resolution, making their equivalence context-dependent and not fully resolvable without execution.[4] This limitation means normalization cannot eliminate all potential false negatives, as exhaustive equivalence testing would impose undue computational costs.[4]
Canonicalization Goals
Canonicalization in URI processing refers to the transformation of a URI into its simplest, unique representation within an equivalence class, ensuring that equivalent URIs map to the same standardized form.[2] This process aims to eliminate syntactic variations that do not alter the resource identified, producing a form suitable for consistent comparison and storage.[23] The primary goals of canonicalization include enhancing interoperability across diverse systems such as web servers, browsers, and APIs, where inconsistent URI representations can lead to failed resolutions or mismatched resources.[7] For security, it prevents bypasses of access controls by exploiting variant forms, such as those with redundant encodings or default ports, thereby reducing vulnerabilities in authentication and authorization mechanisms.[24] Efficiency is another key objective, enabling optimized operations like hash-based storage, caching, and indexing by minimizing duplicate entries and facilitating faster equivalence checks.[25] These goals align with established standards, including RFC 3986, which specifies syntax-based normalization to reduce aliases while preserving semantics, and the WHATWG URL Standard (initially published in 2014 with ongoing updates), which emphasizes idempotent parsing and serialization for web contexts.[2][26] A representative practice is the omission of default ports in canonical forms, such as excluding:80 for HTTP URIs (e.g., http://[example.com](/page/Example.com) instead of http://example.com:80), to promote uniformity without changing the identified resource.[16][27]
Challenges in achieving these goals involve balancing thoroughness with computational performance, as exhaustive normalization can introduce overhead in high-volume systems.[9] Additionally, not all URIs admit a single canonical form due to scheme-specific rules, such as varying interpretations of empty hosts or paths across protocols like HTTP and FTP.[9][28]
Normalization Techniques
Semantics-Preserving Transformations
Semantics-preserving transformations in URI normalization refer to syntactic adjustments that standardize the representation of a URI without altering its underlying meaning or resolution behavior. These operations are scheme-independent and apply universally to generic URI syntax, ensuring that the transformed URI resolves to the exact same resource as the original. According to RFC 3986, such transformations are designed to eliminate syntactic variations that do not affect equivalence, thereby facilitating consistent comparison and processing across systems. Case normalization is a fundamental semantics-preserving transformation that involves converting certain URI components to lowercase, as uppercase and lowercase forms are semantically equivalent in those contexts. Specifically, the scheme (e.g., "HTTP" to "http") and the host (e.g., "Example.com" to "example.com") are lowercased, while other components like paths, queries, and fragments remain case-sensitive and unchanged. For instance, the URIHTTP://Example.com/ normalizes to http://example.com/, preserving the reference to the same origin server. This rule stems from the ASCII-insensitive nature of scheme and host matching in URI resolution, as defined in the generic syntax.[20]
Percent-encoding normalization addresses variations in how characters are encoded using the "%" followed by hexadecimal digits. It requires uppercasing all hexadecimal digits in percent-encodings (e.g., "%3f" to "%3F") and, where applicable, decoding unnecessary percent-encoded unreserved characters (such as ALPHA, DIGIT, hyphen, period, underscore, or tilde) to their literal forms. This ensures that equivalent encodings, like "%41" and "%61" for 'A' and 'a', are standardized without changing the octet sequence interpreted by the target resource. An example is the transformation of http://[example.com](/page/Example.com)/foo%2Fbar to http://[example.com](/page/Example.com)/foo%2Fbar (with hex digits uppercased if lowercase), maintaining identical semantics since percent-decoding yields the same byte sequence. RFC 3986 recommends these steps as safe for all URIs, as they operate solely on the syntactic layer without relying on scheme-specific decoding rules.[29]
Path segment normalization removes redundant or ambiguous elements in the path component to produce a canonical form. This includes replacing "/./" with "/" (removing current-directory references), resolving "/../" by navigating upward and removing the parent directory (with adjustments to avoid exceeding the root), and eliminating empty path segments like consecutive slashes "//". For example, http://[example.com](/page/Example.com)/a/b/../c/./d normalizes to http://[example.com](/page/Example.com)/a/c/d, as the "/../" removes "b" and "/./" collapses to nothing, without altering the resource hierarchy. These operations are purely syntactic and preserve semantics because they mirror the standard path resolution algorithm in URI handling, applicable across schemes like HTTP and file.[30]
A practical illustration of combined semantics-preserving transformations is the normalization of http://exAMPle.com/A%2fb to http://example.com/A%2Fb, where case normalization lowercases the scheme and host, percent-encoding normalization uppercases the hex digits in "%2fb" to "%2Fb", and path normalization has no effect here (path case is preserved as case-sensitive; "%2F" remains encoded as "/" is reserved). This results in http://example.com/A%2Fb, fully equivalent to the original in resolution. RFC 3986 explicitly endorses these as the core safe transformations for achieving syntactic canonicalization, confirming their preservation of URI equivalence through scheme-independent rules that avoid any interpretation of the resource's content or context.[2]
Conditional Transformations
Conditional transformations in URI normalization involve modifications that typically preserve the semantic equivalence of the URI but are applied only under specific conditions, such as the URI scheme, environmental context, or application requirements. These transformations are not universally safe like syntax-only changes but are commonly used in protocols like HTTP and HTTPS to standardize representations while avoiding unintended alterations in other schemes, such as file:// or mailto://. Misapplication of these transformations can lead to equivalence errors or security risks, such as exposing sensitive information or resolving to incorrect resources.[9] Host normalization often includes lowercasing the host component and handling internationalized domain names (IDNs) through Punycode encoding, but this is conditional on schemes that support domain name resolution, primarily HTTP and HTTPS. For IDNs, the host is converted to its ASCII-compatible encoding (ACE) form using the Internationalizing Domain Names in Applications (IDNA) protocol, which employs Punycode to represent Unicode characters in a DNS-compatible format; for example, "faß.example" becomes "xn--fa-hia.example". This step ensures consistent resolution but is not applied to opaque hosts or non-resolvable schemes like file://, where such encoding could alter the intended local path semantics.[15][31] Default port removal is another scheme-dependent transformation, where the port component and its delimiter are omitted if the port matches the scheme's default value, such as :80 for HTTP or :443 for HTTPS. For instance, "http://example.com:80/" normalizes to "http://example.com/", preserving access to the same resource while simplifying the URI. This is not applied to schemes without defined defaults, like generic URI references or custom protocols, to avoid assuming connectivity details that might change the URI's interpretation.[16] For HTTP URIs, an empty path is normalized to "/". For example, "http://example.com" becomes "http://example.com/".[](https://datatracker.ietf.org/doc/html/rfc3986#section-6.2.3) Query parameter sorting is an optional technique used in some canonical comparison scenarios, where parameters are reordered alphabetically by key (with values) to facilitate equivalence checks when the application treats parameter order as insignificant. This is common in web security contexts, such as signature validation in APIs, but skipped if order matters, as in certain RESTful services where sequence affects processing (e.g., multi-step form submissions). For example, "?b=2&a=1" might normalize to "?a=1&b=2" for deduplication, yet this risks semantic loss in order-dependent APIs. This practice is application-specific and not part of the standard URI normalization defined in RFC 3986. Additional examples include IPv6 address handling, where literal IPv6 hosts must be enclosed in square brackets during normalization to distinguish the address from port delimiters, such as normalizing "[2001:db8::1]:8080" while ensuring brackets for the host part in schemes like HTTP. Similarly, in non-authentication contexts, the userinfo subcomponent (username:password) may be removed entirely, as it is deprecated and poses privacy risks when logged or shared; for instance, "http://user:pass@[example.com](/page/Example.com)/" becomes "http://example.com/" for HTTP URIs without basic auth needs. These are applied conditionally to avoid breaking schemes like those relying on embedded credentials, such as certain proxy configurations, and underscore the importance of scheme awareness to prevent misresolution.[15][14][32]Semantics-Altering Transformations
Semantics-altering transformations in URI normalization involve modifications that change the intended meaning or scope of the URI, diverging from the standard equivalence rules outlined in RFC 3986. These are applied selectively in contexts such as resource comparison, security analysis, or content archiving, where preserving the exact reference is secondary to broader operational needs. Unlike syntax-preserving normalizations, these steps can introduce ambiguities or unintended behaviors if misapplied, as they alter how the URI identifies a resource.[7] Fragment removal is a common semantics-altering transformation used when comparing URIs for resource retrieval, as the fragment identifier ("#fragment") targets a secondary resource within the primary one and is not transmitted to the server during dereferencing. According to RFC 3986, fragments are processed client-side based on the representation's media type, making URIs differing only in fragments non-equivalent for full reference comparison but equivalent when excluding fragments for primary resource identification. Removing the fragment, such as transforming "http://example.com/page#section" to "http://example.com/page", shifts the semantics from a specific intra-document location to the entire resource, which is useful in caching or indexing but risks losing navigation intent. This practice is explicitly noted in the RFC as altering equivalence for dereference operations.[19][4] Relative URI resolution converts a relative reference to an absolute URI by merging it with a specified base URI, inherently assuming a contextual scope that can alter the target resource if the base is incorrect or contextually mismatched. The resolution algorithm in RFC 3986 involves parsing the base into components, inheriting the scheme and authority if absent in the relative form, and appending or merging paths while removing dot-segments, which embeds assumptions about the environment and may expand the URI's scope beyond its original relative intent. For instance, resolving "./path" against "http://example.com/dir/" yields "http://example.com/dir/path", but using a different base like "https://other.com/" changes the authority and security context entirely. This transformation is essential for processing relative links in documents but can lead to semantic shifts in cross-context applications.[33] Scheme conversion, such as changing "http" to "https", modifies the protocol's inherent semantics, particularly regarding security and transport, and is not part of standard URI normalization but appears in redirect scenarios or enforcement policies. RFC 3986 defines schemes as case-insensitive but distinct in their operational rules, so altering the scheme redefines the URI's access method and potential encryption, potentially violating equivalence. This is employed in web security to enforce secure connections but introduces risks like open redirects, where attackers manipulate redirect parameters to bypass intended domains, as documented in CWE-601. For example, a redirect from "http://example.com/redirect?url=http://malicious.com" to HTTPS might still enable phishing if validation is lax. Such conversions are cautioned against in generic normalization due to their impact on resource accessibility.[12][34] These transformations carry risks in security contexts, such as enabling open redirects or semantic attacks where crafted URIs exploit normalization to mislead users or servers, as highlighted in RFC 3986's security considerations. They are used judiciously in archiving to standardize references or in security tools to simulate retrieval without fragments, but RFC 3986 emphasizes that non-equivalent changes like these should be avoided in equivalence testing to prevent false positives in resource identification. In practice, applications must document such alterations to mitigate unintended semantic shifts.[24][7]Implementation Approaches
Step-by-Step Normalization Process
The step-by-step normalization process for URIs, as outlined in RFC 3986, transforms a given URI reference into a canonical form suitable for equivalence comparisons by applying a sequence of syntax-based and scheme-based transformations while preserving semantics. This process begins with parsing the URI into its components (scheme, authority, path, query, and fragment) using the regular expression provided in Appendix B of RFC 3986, which ensures accurate disambiguation of elements through a "first-match-wins" greedy algorithm.[35] For Internationalized Resource Identifiers (IRIs), an initial conversion to URI form is required per RFC 3987, involving Unicode Normalization Form C (NFC) on the character sequence followed by UTF-8 encoding and percent-encoding of non-ASCII characters in components outside the authority's ireg-name (with optional Punycode for internationalized domain names via ToASCII).[36] The resulting URI then undergoes the core normalization steps to produce a comparison-ready form that minimizes syntactic variations without altering the resource reference. The algorithm proceeds as follows:- Parse the URI into components: Decompose the input string into scheme, authority (userinfo, host, port), path, query, and fragment using the ABNF grammar or the equivalent regular expression from Appendix B. This step identifies boundaries, such as authority delimiters (//) and path separators (/), handling relative references by resolving against a base URI if needed via the algorithm in Section 5.2 of RFC 3986.[33][35]
- Apply case normalization: Convert the scheme to lowercase (e.g., "HTTP" becomes "http"). For the host component within authority, perform case normalization to lowercase, except for IPv6 addresses which remain unchanged. Userinfo and port are left as-is unless scheme-specific rules apply. This ensures equivalence for schemes and hosts that are case-insensitive per Section 3.1 of RFC 3986.[20]
- Normalize percent-encodings: Decode all percent-encoded octets that correspond to unreserved characters (ALPHA, DIGIT, hyphen, period, underscore, tilde) back to their literal form, as these encodings are semantically equivalent per Section 2.3 of RFC 3986 (e.g., "%7E" becomes "~"). Retain encodings for reserved characters (gen-delims, sub-delims) unless they represent unreserved ones. Use uppercase hexadecimal digits (A-F) for any remaining percent-encodings to standardize representation. Malformed percent-encodings (e.g., incomplete %HH or invalid hex) should trigger error handling, such as rejecting the URI or applying partial normalization by skipping invalid segments.[29][37]
- Resolve path segments: Apply the remove_dot_segments algorithm from Section 5.2.4 of RFC 3986 to the path component to eliminate "." and ".." segments, merging adjacent slashes and handling empty paths scheme-dependently (e.g., normalize empty HTTP paths to "/"). This step uses an iterative buffer-based approach: initialize an output buffer, process input path segments one by one—skipping "." segments, popping the last output segment for ".." (if non-empty), and appending other segments—until the input is exhausted, then join with "/". An iterative implementation is preferred over recursive for performance, especially with deeply nested paths, as recursion risks stack overflow in long URIs while iteration processes in O(n) time with constant space.[3][30]
- Handle query and fragment conditionally: For the query, apply percent-decoding of unreserved characters as in step 3, but avoid re-encoding unless necessary for scheme-specific equivalence (e.g., HTTP query parameters). The fragment is typically not normalized beyond percent-decoding, as it is opaque and client-side, though sorting query parameters by name can be a protocol-based extension for canonicalization in contexts like digital signatures. Invalid queries (e.g., unbalanced delimiters) may warrant partial normalization by truncating or flagging.[9]
- Reassemble the normalized URI: Concatenate the processed components per the generic syntax in Section 3 of RFC 3986: scheme + ":" + "//" (if authority present) + authority + path + "?" + query (if present) + "#" + fragment (if present). Omit default elements like port 80 for HTTP to achieve scheme-based normalization. This yields a canonical form ready for comparison, where two URIs are equivalent if their normalized strings match exactly.[9][10]
