Recent from talks
Contribute something
Nothing was collected or created yet.
Uniform Resource Identifier
View on WikipediaThis article needs additional citations for verification. (August 2025) |
| Uniform Resource Identifier | |
|---|---|
| Abbreviation | URI |
| Native name | |
| Status | Active |
| Organization | Internet Engineering Task Force |
| Authors | |
| Domain | World Wide Web |
| Website | datatracker |
A Uniform Resource Identifier (URI), formerly Universal Resource Identifier, is a unique sequence of characters that identifies an abstract or physical resource,[1]: 1 such as resources on a webpage, email address, phone number,[1]: 7 books, real-world objects such as people and places, and concepts[1]: 5
URIs which provide a means of locating and retrieving information resources on a network (either on the Internet or on another private network, such as a computer file system or an Intranet) are Uniform Resource Locators (URLs). Therefore, URLs are a subset of URIs, i.e. every URL is a URI (and not necessarily the other way around).[1]: 7 Other URIs provide only a unique name, without a means of locating or retrieving the resource or information about it; these are Uniform Resource Names (URNs). The web technologies that use URIs are not limited to web browsers.
History
[edit]Conception
[edit]URIs and URLs have a shared history. In 1990, Tim Berners-Lee's proposals for hypertext implicitly introduced the idea of a URL as a short string representing a resource that is the target of a hyperlink.[2] At the time, people referred to it as a "hypertext name"[3] or "document name".
Over the next three and a half years, as the World Wide Web's core technologies of HTML, HTTP, and web browsers developed, a need to distinguish a string that provided an address for a resource from a string that merely named a resource emerged. Although not yet formally defined, the term Uniform Resource Locator came to represent the former, and the more contentious Uniform Resource Name came to represent the latter. In July 1992 Berners-Lee's report on the Internet Engineering Task Force (IETF) "UDI (Universal Document Identifiers) BOF" mentions URLs (as Uniform Resource Locators), URNs (originally, as Unique Resource Numbers), and the need to charter a new working group.[4] In November 1992 the IETF "URI Working Group" met for the first time.[5]
During the debate over defining URLs and URNs, it became evident that the concepts embodied by the two terms were merely aspects of the fundamental, overarching, notion of resource identification. In June 1994, the IETF published RFC 1630, Berners-Lee's first Request for Comments that acknowledged the existence of URLs and URNs. Most importantly, it defined a formal syntax for Universal Resource Identifiers (i.e. URL-like strings whose precise syntaxes and semantics depended on their schemes). It also attempted to summarize the syntaxes of URL schemes in use at the time. It acknowledged – but did not standardize—the existence of relative URLs and fragment identifiers.[6]
Refinement
[edit]In December 1994, RFC 1738[7] formally defined relative and absolute URLs, refined the general URL syntax, defined how to resolve relative URLs to absolute form, and better enumerated the URL schemes then in use. The agreed definition and syntax of URNs had to wait until the publication of IETF RFC 2141[8] in May 1997.
The publication of IETF RFC 2396[9] in August 1998 saw the URI syntax become a separate specification[9] and most of the parts of RFCs 1630 and 1738 relating to URIs and URLs in general were revised and expanded by the IETF. The new RFC changed the meaning of U in URI from "Universal" to "Uniform."
In December 1999, RFC 2732[10] provided a minor update to RFC 2396, allowing URIs to accommodate IPv6 addresses. A number of shortcomings discovered in the two specifications led to a community effort, coordinated by RFC 2396 co-author Roy Fielding, that culminated in the publication of IETF RFC 3986[1] in January 2005. While obsoleting the prior standard, it did not render the details of existing URL schemes obsolete; RFC 1738 continues to govern such schemes except where otherwise superseded. IETF RFC 2616[11] for example, refines the http scheme. Simultaneously, the IETF published the content of RFC 3986 as the full standard STD 66, reflecting the establishment of the URI generic syntax as an official Internet protocol.
In 2001, the World Wide Web Consortium's (W3C) Technical Architecture Group (TAG) published a guide to best practices and canonical URIs for publishing multiple versions of a given resource.[12] For example, content might differ by language or by size to adjust for capacity or settings of the device used to access that content.
In August 2002, IETF RFC 3305[13] pointed out that the term "URL" had, despite widespread public use, faded into near obsolescence, and serves only as a reminder that some URIs act as addresses by having schemes implying network accessibility, regardless of any such actual use. As URI-based standards such as Resource Description Framework make evident, resource identification need not suggest the retrieval of resource representations over the Internet, nor need they imply network-based resources at all.
The Semantic Web uses the HTTP URI scheme to identify both documents and concepts for practical uses, a distinction which has caused confusion as to how to distinguish the two. The TAG published an e-mail in 2005 with a solution of the problem, which became known as the httpRange-14 resolution.[14] The W3C subsequently published an Interest Group Note titled "Cool URIs for the Semantic Web", which explained the use of content negotiation and the HTTP 303 response code for redirections in more detail.[15]
Design
[edit]URLs and URNs
[edit]A Uniform Resource Name (URN) is a URI that identifies a resource by name in a particular namespace. A URN may be used to talk about a resource without implying its location or how to access it. For example, in the International Standard Book Number (ISBN) system, ISBN 0-486-27557-4 identifies a specific edition of the William Shakespeare play Romeo and Juliet. The URN for that edition would be urn:isbn:0-486-27557-4. However, it gives no information as to where to find a copy of that book.
A Uniform Resource Locator (URL) is a URI that specifies the means of acting upon or obtaining the representation of a resource, i.e. specifying both its primary access mechanism and network location. For example, the URL http://example.org/wiki/Main_Page refers to a resource identified as /wiki/Main_Page, whose representation is obtainable via the Hypertext Transfer Protocol (http:) from a network host whose domain name is example.org. (In this case, HTTP usually implies it to be in the form of HTML and related code. In practice, that is not necessarily the case, as HTTP allows specifying arbitrary formats in its header.)
A URN is analogous to a person's name, while a URL is analogous to their street address. In other words, a URN identifies an item and a URL provides a method for finding it.
Technical publications, especially standards produced by the IETF and by the W3C, normally reflect a view outlined in a W3C Recommendation of 30 July 2001, which acknowledges the precedence of the term URI rather than endorsing any formal subdivision into URL and URN.
URL is a useful but informal concept: a URL is a type of URI that identifies a resource via a representation of its primary access mechanism (e.g., its network "location"), rather than by some other attributes it may have.[16]
As such, a URL is simply a URI that happens to point to a resource over a network.[a][13] However, in non-technical contexts and in software for the World Wide Web, the term "URL" remains widely used. Additionally, the term "web address" (which has no formal definition) often occurs in non-technical publications as a synonym for a URI that uses the http or https schemes. Such assumptions can lead to confusion, for example, in the case of XML namespaces that have a visual similarity to resolvable URIs.
Specifications produced by the WHATWG prefer URL over URI, and so newer HTML5 APIs use URL over URI.[17]
Standardize on the term URL. URI and IRI [Internationalized Resource Identifier] are just confusing. In practice a single algorithm is used for both so keeping them distinct is not helping anyone. URL also easily wins the search result popularity contest.[18]
While most URI schemes were originally designed to be used with a particular protocol, and often have the same name, they are semantically different from protocols. For example, the scheme http is generally used for interacting with web resources using HTTP, but the scheme file has no protocol.
Syntax
[edit]A URI has a scheme that refers to a specification for assigning identifiers within that scheme. As such, the URI syntax is a federated and extensible naming system wherein each scheme's specification may further restrict the syntax and semantics of identifiers using that scheme. The URI generic syntax is a superset of the syntax of all URI schemes. It was first defined in RFC 2396, published in August 1998,[9] and finalized in RFC 3986, published in January 2005.[19]
A URI is composed from an allowed set of ASCII characters consisting of reserved characters (gen-delims: :, /, ?, #, [, ], and @; sub-delims: !, $, &, ', (, ), *, +, ,, ;, and =),[1]: 13–14 unreserved characters (uppercase and lowercase letters, decimal digits, -, ., _, and ~),[1]: 13–14 and the character %.[1]: 12 Syntax components and subcomponents are separated by delimiters from the reserved characters (only from generic reserved characters for components) and define identifying data represented as unreserved characters, reserved characters that do not act as delimiters in the component and subcomponent respectively,[1]: §2 and percent-encodings when the corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component. A percent-encoding of an identifying data octet is a sequence of three characters, consisting of the character % followed by the two hexadecimal digits representing that octet's numeric value.[1]: §2.1
The URI generic syntax consists of five components organized hierarchically in order of decreasing significance from left to right:[1]: §3
URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment]
A component is undefined if it has an associated delimiter and the delimiter does not appear in the URI; the scheme and path components are always defined.[1]: §5.2.1 A component is empty if it has no characters; the scheme component is always non-empty.[1]: §3
The authority component consists of subcomponents:
authority = [userinfo "@"] host [":" port]
This is represented in a syntax diagram as:
The URI comprises:
- A non-empty scheme component followed by a colon (
:), consisting of a sequence of characters beginning with a letter and followed by any combination of letters, digits, plus (+), period (.), or hyphen (-). Although schemes are case-insensitive, the canonical form is lowercase and documents that specify schemes must do so with lowercase letters. Examples of popular schemes includehttp,https,ftp,mailto,file,dataandirc. URI schemes should be registered with the Internet Assigned Numbers Authority (IANA), although non-registered schemes are used in practice.[20] - An optional authority component preceded by two slashes (
//), comprising:- An optional userinfo subcomponent followed by an at symbol (
@), that may consist of a user name and an optional password preceded by a colon (:). Use of the formatusername:passwordin the userinfo subcomponent is deprecated for security reasons. Applications should not render as clear text any data after the first colon (:) found within a userinfo subcomponent unless the data after the colon is the empty string (indicating no password). - A host subcomponent, consisting of either a registered name (including but not limited to a hostname) or an IP address. IPv4 addresses must be in dot-decimal notation, and IPv6 addresses must be enclosed in brackets (
[]).[1]: §3.2.2 [b] - An optional port subcomponent preceded by a colon (
:), consisting of decimal digits.
- An optional userinfo subcomponent followed by an at symbol (
- A path component, consisting of a sequence of path segments separated by a slash (
/). A path is always defined for a URI, though the defined path may be empty (zero length). A segment may also be empty, resulting in two consecutive slashes (//) in the path component. A path component may resemble or map exactly to a file system path but does not always imply a relation to one. If an authority component is defined, then the path component must either be empty or begin with a slash (/). If an authority component is undefined, then the path cannot begin with an empty segment—that is, with two slashes (//)—since the following characters would be interpreted as an authority component.[9]: §3.3
- By convention, in http and https URIs, the last part of a path is named pathinfo and it is optional. It is composed by zero or more path segments that do not refer to an existing physical resource name (e.g. a file, an internal module program or an executable program) but to a logical part (e.g. a command or a qualifier part) that has to be passed separately to the first part of the path that identifies an executable module or program managed by a web server; this is often used to select dynamic content (a document, etc.) or to tailor it as requested (see also: CGI and PATH_INFO, etc.).
- Example:
- URI:
"http://www.example.com/questions/3456/my-document" - where:
"/questions"is the first part of the path (an executable module or program) and"/3456/my-document"is the second part of the path named pathinfo, which is passed to the executable module or program named"/questions"to select the requested document.
- URI:
- An http or https URI containing a pathinfo part without a query part may also be referred to as a 'clean URL,' whose last part may be a 'slug.'
| Query delimiter | Example |
|---|---|
Ampersand (&)
|
key1=value1&key2=value2
|
Semicolon (;)[c]
|
key1=value1;key2=value2
|
- An optional query component preceded by a question mark (
?), consisting of a query string of non-hierarchical data. Its syntax is not well defined, but by convention is most often a sequence of attribute–value pairs separated by a delimiter. - An optional fragment component preceded by a hash (
#). The fragment contains a fragment identifier providing direction to a secondary resource, such as a section heading in an article identified by the remainder of the URI. When the primary resource is an HTML document, the fragment is often anidattribute of a specific element, and web browsers will scroll this element into view.
The scheme- or implementation-specific reserved character + may be used in the scheme, userinfo, host, path, query, and fragment, and the scheme- or implementation-specific reserved characters !, $, &, ', (, ), *, ,, ;, and = may be used in the userinfo, host, path, query, and fragment. Additionally, the generic reserved character : may be used in the userinfo, path, query and fragment, the generic reserved characters @ and / may be used in the path, query and fragment, and the generic reserved character ? may be used in the query and fragment.[1]: §A
Example URIs
[edit]The following figure displays example URIs and their component parts.

DOIs (digital object identifiers) fit within the Handle System and fit within the URI system, as facilitated by appropriate syntax.
URI references
[edit]A URI reference is either a URI or a relative reference when it does not begin with a scheme component followed by a colon (:).[1]: §4.1 A path segment that contains a colon character (e.g., foo:bar) cannot be used as the first path segment of a relative reference if its path component does not begin with a slash (/), as it would be mistaken for a scheme component. Such a path segment must be preceded by a dot path segment (e.g., ./foo:bar).[1]: §4.2
Web document markup languages frequently use URI references to point to other resources, such as external documents or specific portions of the same logical document:[1]: §4.4
- in HTML, the value of the
srcattribute of theimgelement provides a URI reference, as does the value of thehrefattribute of theaorlinkelement; - in XML, the system identifier appearing after the
SYSTEMkeyword in a DTD is a fragmentless URI reference; - in XSLT, the value of the
hrefattribute of thexsl:importelement/instruction is a URI reference; likewise the first argument to thedocument()function.
https://example.com/path/resource.txt#fragment //example.com/path/resource.txt /path/resource.txt path/resource.txt ../resource.txt ./resource.txt resource.txt #fragment
Resolution
[edit]Resolving a URI reference against a base URI results in a target URI. This implies that the base URI exists and is an absolute URI (a URI with no fragment component). The base URI can be obtained, in order of precedence, from:[1]: §5.1
- the reference URI itself if it is a URI;
- the content of the representation;
- the entity encapsulating the representation;
- the URI used for the actual retrieval of the representation;
- the context of the application.
Within a representation with a well defined base URI of
http://a/b/c/d;p?q
a relative reference is resolved to its target URI as follows:[1]: §5.4
"g:h" -> "g:h" "g" -> "http://a/b/c/g" "./g" -> "http://a/b/c/g" "g/" -> "http://a/b/c/g/" "/g" -> "http://a/g" "//g" -> "http://g" "?y" -> "http://a/b/c/d;p?y" "g?y" -> "http://a/b/c/g?y" "#s" -> "http://a/b/c/d;p?q#s" "g#s" -> "http://a/b/c/g#s" "g?y#s" -> "http://a/b/c/g?y#s" ";x" -> "http://a/b/c/;x" "g;x" -> "http://a/b/c/g;x" "g;x?y#s" -> "http://a/b/c/g;x?y#s" "" -> "http://a/b/c/d;p?q" "." -> "http://a/b/c/" "./" -> "http://a/b/c/" ".." -> "http://a/b/" "../" -> "http://a/b/" "../g" -> "http://a/b/g" "../.." -> "http://a/" "../../" -> "http://a/" "../../g" -> "http://a/g"
URL munging
[edit]URL munging is a technique by which a command is appended to a URL, usually at the end, after a "?" token. It is commonly used in WebDAV as a mechanism of adding functionality to HTTP. In a versioning system, for example, to add a "checkout" command to a URL, it is written as http://editing.com/resource/file.php?command=checkout. It has the advantage of both being easy for CGI parsers and also acts as an intermediary between HTTP and underlying resource, in this case.[24]
Relation to XML namespaces
[edit]In XML, a namespace is an abstract domain to which a collection of element and attribute names can be assigned. The namespace name is a character string which must adhere to the generic URI syntax.[25] However, the name is generally not considered to be a URI,[26] because the URI specification bases the decision not only on lexical components, but also on their intended use. A namespace name does not necessarily imply any of the semantics of URI schemes; for example, a namespace name beginning with http: may have no connotation to the use of the HTTP.
Originally, the namespace name could match the syntax of any non-empty URI reference, but the use of relative URI references was deprecated by the W3C.[27] A separate W3C specification for namespaces in XML 1.1 permits Internationalized Resource Identifier (IRI) references to serve as the basis for namespace names in addition to URI references.[28]
See also
[edit]- CURIE
- Linked data
- Extensible Resource Identifier
- Internationalized Resource Identifier (IRI)
- Internet resource locator
- Persistent uniform resource locator
- Uniform Naming Convention
- Resource Directory Description Language
- Universally unique identifier
- List of URI schemes
- Resource Description Framework
Notes
[edit]- ^ A report published in 2002 by a joint W3C/IETF working group aimed to normalize the divergent views held within the IETF and W3C over the relationship between the various 'UR*' terms and standards. While not published as a full standard by either organization, it has become the basis for the above common understanding and has informed many standards since then.
- ^ For URIs relating to resources on the World Wide Web, some web browsers allow
.0portions of dot-decimal notation to be dropped or raw integer IP addresses to be used.[21] - ^ Historic RFC 1866 (obsoleted by RFC 2854[22]) encourages CGI authors to support ';' in addition to '&'.[23]: §8.2.1
References
[edit]- ^ a b c d e f g h i j k l m n o p q r s t u T. Berners-Lee; R. Fielding; L. Masinter (January 2005). Uniform Resource Identifier (URI): Generic Syntax. Network Working Group. doi:10.17487/RFC3986. STD 66. RFC 3986. Internet Standard 66. Obsoletes RFC 2732, 2396 and 1808. Updated by RFC 6874, 7320 and 8820. Updates RFC 1738.
- ^ Palmer, Sean. "The Early History of HTML". infomesh.net. Retrieved 2020-12-06.
- ^ "W3 Naming Schemes". W3C. 1992-02-24. Retrieved 2020-12-06.
- ^ "Proceedings of the Twenty-Fourth Internet Engineering Task Force" (PDF). IETF. Corporation for National Research Initiatives. July 1992. p. 193. Retrieved 2021-07-27.
- ^ "Proceedings of the Twenty-Fifth Internet Engineering Task Force" (PDF). IETF. Corporation for National Research Initiatives. November 1992. p. 501. Retrieved 2021-07-27.
- ^ Berners-Lee, Tim (June 1994). Universal Resource Identifiers in WWW: A Unifying Syntax for the Expression of Names and Addresses of Objects on the Network as used in the World-Wide Web. Network Working Group. doi:10.17487/RFC1630. RFC 1630. Informational.
- ^ T. Berners-Lee; L. Masinter; M. McCahill (December 1994). Uniform Resource Locators (URL). Network Working Group. doi:10.17487/RFC1738. RFC 1738. Obsolete. Obsoleted by RFC 4248 and 4266. Updated by RFC 1808, 2368, 2396, 3986, 6196, 6270 and 8089.
- ^ R. Moats (May 1997). P. Vixie (ed.). URN Syntax. IETF Network Working Group. doi:10.17487/RFC2141. RFC 2141. Proposed Standard. Obsoleted by RFC 8141.
- ^ a b c d T. Berners-Lee; R. Fielding; L. Masinter (August 1998). Uniform Resource Identifiers (URI): Generic Syntax. Network Working Group. doi:10.17487/RFC2396. RFC 2396. Obsolete. Obsoleted by RFC 3986. Updated by RFC 2732. Updates RFC 1808 and 1738.
- ^ R. Hinden; B. Carpenter; L. Masinter (December 1999). Format for Literal IPv6 Addresses in URL's. Network Working Group. doi:10.17487/RFC2732. RFC 2732. Obsolete. Obsoleted by RFC 3986.
- ^ R. Fielding; J. Gettys; J. Mogul; H. Frystyk; L. Masinter; P. Leach; T. Berners-Lee (August 1999). Hypertext Transfer Protocol -- HTTP/1.1. Network Working Group. doi:10.17487/RFC2616. RFC 2616. Obsolete. Obsoleted by RFC 7230, 7231, 7232, 7233, 7234 and 7235. Obsoletes RFC 2068. Updated by RFC 2817, 5785, 6266 and 6585.
- ^ Raman, T.V. (2006-11-01). "On Linking Alternative Representations To Enable Discovery And Publishing". W3C. Retrieved 2020-12-06.
- ^ a b Mealling, Michael H.; Denenberg, Ray (August 2002). Report from the Joint W3C/IETF URI Planning Interest Group: Uniform Resource Identifiers (URIs), URLs, and Uniform Resource Names (URNs): Clarifications and Recommendations. Network Working Group. doi:10.17487/RFC3305. RFC 3305. Informational.
- ^ Fielding, Roy (2005-06-18). "[httpRange-14] Resolved". W3C Public mailing list archives. Retrieved 2020-12-06.
- ^ Ayers, Danny; Völkel, Max (2008-12-03). Sauermann, Leo; Cyganiak, Richard (eds.). "Cool URIs for the Semantic Web". W3C. Retrieved 2020-12-06.
- ^ URI Planning Interest Group, W3C/IETF (September 2001). "URIs, URLs, and URNs: Clarifications and Recommendations 1.0". www.w3.org. W3C/IETF. Retrieved 2020-12-08.
- ^ "6.3. URL APIs elsewhere". URL Standard. 2025-05-12.
- ^ "URL Standard: Goals".
- ^ Berners-Lee, Tim; Fielding, Roy T.; Masinter, Larry 2005, p. 46; "9. Acknowledgements"
- ^ Hansen, Tony; Hardie, Ted (June 2015). Thaler, Dave (ed.). Guidelines and Registration Procedures for URI Schemes. Internet Engineering Task Force. doi:10.17487/RFC7595. ISSN 2070-1721. BCP 35. RFC 7595. Best Current Practice 35. Updated by RFC 8615. Obsoletes RFC 4395.
- ^ Lawrence (2014).
- ^ D. Connolly; L. Masinter (June 2000). The 'text/html' Media Type. Network Working Group. doi:10.17487/RFC2854. RFC 2854. Informational / Legacy. Obsoletes RFC 1980, 1867, 1942, 1866 and 2070. Not endorsed by the IETF.
- ^ Berners-Lee, Tim; Connolly, Daniel W. (November 1995). Hypertext Markup Language - 2.0. Network Working Group. doi:10.17487/RFC1866. RFC 1866. Historic. Obsoleted by RFC 2854.
- ^ Whitehead 1998, p. 38.
- ^ Morrison (2006).
- ^ Harold (2004).
- ^ W3C (2009).
- ^ W3C (2006).
Works cited
[edit]- Bray, Tim; Hollander, Dave; Layman, Andrew; Tobin, Richard, eds. (2006-08-16). "Namespaces in XML 1.1 (Second Edition)". World Wide Web Consortium. 2.2 Use of URIs as Namespace Names. Retrieved 2015-08-31.
- Bray, Tim; Hollander, Dave; Layman, Andrew; Tobin, Richard; Thompson, Henry S., eds. (2009-12-08). "Namespaces in XML 1.0 (Third Edition)". World Wide Web Consortium. 2.2 Use of URIs as Namespace Names. Retrieved 2015-08-31.
- Harold, Elliotte Rusty (2004). XML 1.1 Bible (Third ed.). Wiley Publishing. p. 291. ISBN 978-0-7645-4986-1.
- Lawrence, Eric (2014-03-06). "Browser Arcana: IP Literals in URLs". IEInternals. Microsoft. Retrieved 2016-04-25.
- Morrison, Michael Wayne (2006). "Hour 5: Putting Namespaces to Use". Sams Teach Yourself XML. Sams Publishing. p. 91.
- Whitehead, E.J (1998). "WebDAV: IEFT standard for collaborative authoring on the Web". IEEE Internet Computing. 2 (5): 34–40. doi:10.1109/4236.722228. ISSN 1941-0131.
Further reading
[edit]- URI Planning Interest Group, W3C/IETF (2001-09-21). "URIs, URLs, and URNs: Clarifications and Recommendations 1.0". Retrieved 2009-07-27.
- "On Linking Alternative Representations To Enable Discovery And Publishing". World Wide Web Consortium. 2006 [2001]. Retrieved 2012-04-03.
External links
[edit]- URI Schemes – IANA-maintained registry of URI Schemes
- URI schemes on the W3C wiki
- Architecture of the World Wide Web, Volume One, §2: Identification – by W3C
- W3C URI Clarification
Uniform Resource Identifier
View on GrokipediaFundamentals
Definition and Purpose
A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource.[2] This standardized string enables the unique referencing of entities such as documents, services, or concepts within networked systems, without necessarily implying direct access or location.[2] The primary purpose of a URI is to facilitate interoperability across diverse information systems by providing a simple, universal mechanism for naming and referencing resources unambiguously.[3] It supports a federated naming approach, allowing different protocols and schemes to coexist while ensuring consistent identification.[4] Key characteristics include its compactness, which aids in easy transcription and memorization; extensibility, permitting scheme-specific extensions without disrupting the overall framework; and scheme-based identification, where a leading scheme (e.g., "http") dictates the syntax and semantics for the remainder of the identifier.[5][4][4] URIs originated in the early 1990s to address naming inconsistencies arising from the proliferation of protocols and systems for document retrieval on the nascent internet.[6] For instance, the URI "http://example.com" identifies a specific web resource, distinguishing it from other identifiers by its scheme and path components.[2] While URIs form the basis for subtypes like Uniform Resource Locators (URLs) and Uniform Resource Names (URNs), they provide a general framework for resource identification.[7]Components and Syntax Overview
A Uniform Resource Identifier (URI) follows a generic syntax that structures its components to enable uniform identification of resources across different schemes. The overall form isscheme : hier-part [ ? query ] [ # fragment ], where the scheme specifies the protocol or naming system, the hierarchical part often includes an authority and path, and optional query and fragment components provide additional data or references. This syntax ensures interoperability by defining how components delimit and encode information.[8]
The scheme component identifies the URI's naming or protocol scheme, such as http or [mailto](/page/Mailto), and consists of a sequence starting with an alphabetic character followed by alphanumeric characters, plus, period, or hyphen. It is followed by a colon (:) and determines how the rest of the URI is interpreted. The authority component, when present, begins with two slashes (//) and represents a hierarchical addressing authority; it includes an optional userinfo (credentials like username and password, in the form userinfo@), a required host (domain name, IP address, or literal), and an optional port (a decimal number for service identification). For example, in example.com:8080, [example.com](/page/Example.com) is the host and 8080 the port. The path follows the authority (or scheme if no authority) and denotes the resource's hierarchical location, composed of segments separated by slashes (/), such as /documents/file.txt. The optional query component, introduced by a question mark (?), carries non-hierarchical parameters in key-value pairs, like key=value&other=param. Finally, the fragment identifier, starting with a hash (#), points to a secondary resource or internal section within the primary resource, such as #summary.[8][9]
The generic syntax is formally defined using Augmented Backus-Naur Form (ABNF) in RFC 3986. A simplified excerpt of the ABNF grammar for a URI is as follows:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part = "//" authority path-abempty / path-absolute / path-rootless / path-empty
authority = [ userinfo "@" ] host [ ":" port ]
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
query = *( pchar / "/" / "?" )
fragment = *( pchar / "/" / "?" )
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part = "//" authority path-abempty / path-absolute / path-rootless / path-empty
authority = [ userinfo "@" ] host [ ":" port ]
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
query = *( pchar / "/" / "?" )
fragment = *( pchar / "/" / "?" )
pchar represents path characters, including unreserved, percent-encoded, sub-delims, :, and @. This notation specifies the allowable structure and characters for each part.[8][10]
URIs distinguish between reserved and unreserved characters to separate delimiters from data. Reserved characters include generic delimiters like : / ? # [ ] @ and sub-delimiters like ! $ & ' ( ) * + , ; =, which may have special meanings in certain components and must be percent-encoded if used as data. Unreserved characters, such as alphanumeric letters, digits, hyphen (-), period (.), underscore (_), and tilde (~), can appear without encoding. Percent-encoding represents characters outside this set (or reserved ones used as data) as a percent sign (%) followed by two hexadecimal digits, e.g., space as %20 or non-ASCII characters via UTF-8 octet sequences. This ensures safe transmission across systems, with encoded forms equivalent to their decoded counterparts when unreserved.[11][12][13]
To illustrate, consider the URI https://user:[email protected]:8080/path?key=value#section:
- Scheme:
https(specifies secure HTTP protocol).[4] - Authority:
user:[email protected]:8080(userinfouser:pass, hostexample.com, port8080).[9] - Path:
/path(hierarchical resource location).[14] - Query:
key=value(parameters for the request).[15] - Fragment:
section(internal reference within the resource).[16]
%20 to comply with syntax rules.[11]
History
Conception
The foundational concepts for Uniform Resource Identifiers (URIs), including the addressing system now known as URLs, were developed by Tim Berners-Lee in late 1990 as part of his implementation of the first World Wide Web prototype at CERN.[17] This work proposed a unified naming system to reference resources across the internet, influenced by hierarchical naming conventions in earlier systems such as the X.500 directory services and the Domain Name System (DNS).[18] The public conception of the URI syntax emerged in early 1992 through Berners-Lee's Universal Document Identifier (UDI) proposal, which outlined a generic structure to address the growing need for consistent resource referencing.[19] The primary motivation was the fragmentation in internet addressing schemes during the early 1990s, hindering seamless hypertext linking in the World Wide Web project. Protocols like FTP, Gopher, WAIS, and news groups employed incompatible formats—such as FTP's host-relative paths versus Gopher's menu-based selectors—creating barriers to a cohesive "information universe."[19] The UDI addressed these by introducing a canonical, scheme-based syntax that abstracted protocol-specific details, exemplified byfile://info.cern.ch/pub/www/doc/udi1.ps.[19] This enabled dynamic linking regardless of retrieval mechanisms, fostering interoperability.[20]
Key early documents include the February 1992 UDI draft, which solicited feedback and highlighted integrations with WAIS and X.500, and the contemporaneous November 1992 HTTP draft, which embedded URI-like addressing for hypertext retrieval.[19][21] By March 1992, at an IETF BOF, these ideas had evolved into foundational web proposals, with UDI serving as the basis for unified naming across protocols.[20]
Standardization and Evolution
The standardization of Uniform Resource Identifiers (URIs) began with RFC 1630, published in June 1994 by Tim Berners-Lee, which provided an informal definition of URI syntax and its role in enabling a global information infrastructure.[22] This document outlined the basic structure of URIs, including schemes, hierarchical components, and the use of percent-encoding for non-ASCII characters, laying the groundwork for uniform naming and addressing on the World Wide Web without enforcing strict parsing rules. A significant refinement came with RFC 2396 in January 1998, authored by Tim Berners-Lee, Roy Fielding, and Larry Masinter, which introduced a more precise syntax specification and formalized the handling of relative URI references. This update addressed ambiguities in the original syntax, defined equivalence rules for URI comparison, and emphasized the separation of scheme-specific processing, making URIs more robust for internet protocols. The IETF URI Working Group, established around this time, played a central role in these developments, coordinating input from the broader internet community to ensure interoperability. The current standard, RFC 3986 from January 2005, also authored by Fielding, Masinter, and Berners-Lee, obsoleted RFC 2396 and provided a comprehensive, ABNF-based syntax definition with enhanced clarity on internationalization aspects, such as reserved characters and fragment identifiers. This revision incorporated lessons from widespread URI deployment, including better support for secure schemes and normalization procedures to reduce variant representations. URI evolution has continued through integrations with related protocols, such as RFC 7230 (June 2014), which defines HTTP/1.1 semantics and specifies how URIs are processed in HTTP messages, ensuring consistency in web transfers.[23] Additionally, RFC 6874 (February 2013) extended URI handling to include IPv6 literal addresses within the host component, using zone IDs and bracketed notation to accommodate modern networking needs.[24] Support for internationalization advanced with RFC 3987 (January 2005), which introduced Internationalized Resource Identifiers (IRIs) as a superset of URIs, allowing Unicode characters in international contexts while maintaining compatibility through UTF-8 encoding and mapping rules.[25] In the 2020s, discussions within IETF and W3C working groups have explored URI adaptations for decentralized systems, such as Decentralized Identifiers (DIDs) under W3C Recommendation (July 2022), which leverage URI syntax for self-sovereign identity without central authority.[26] Key contributions to URI standardization stem from Tim Berners-Lee's foundational vision, Roy Fielding's architectural refinements in dissertations and RFCs, and collaborative efforts by IETF working groups like URI and Appsawg, which have sustained updates amid evolving web technologies.URI Structure
General Syntax
The general syntax of a Uniform Resource Identifier (URI) is formally defined in RFC 3986 asURI = scheme ":" hier-part [ "?" query ] [ "#" fragment ], where the scheme identifies the URI's namespace and syntax rules, the hierarchical part provides the location or name, the query adds parameters, and the fragment identifies a secondary resource within the primary one.[1]
This syntax is specified using Augmented Backus-Naur Form (ABNF) grammar, which outlines the production rules for each component. The complete relevant ABNF for URI production is as follows:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
authority = [ userinfo "@" ] host [ ":" port ]
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
query = *( pchar / "/" / "?" )
fragment = *( pchar / "/" / "?" )
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
authority = [ userinfo "@" ] host [ ":" port ]
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
query = *( pchar / "/" / "?" )
fragment = *( pchar / "/" / "?" )
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
HEXDIG representing hexadecimal digits (0-9, A-F, a-f) and ALPHA and DIGIT as standard alphabetic and numeric characters.[10]
URIs are classified as absolute or relative based on the presence of a scheme. An absolute URI begins with a scheme followed by a colon and includes a hierarchical part, as in absolute-URI = scheme ":" hier-part [ "?" query ], providing a complete reference independent of context. In contrast, a relative URI lacks a scheme and is resolved against a base URI; its reference form is relative-ref = relative-part [ "?" query ] [ "#" fragment ], where the relative part can be // authority path-abempty (network-path), path-absolute (starting with "/"), path-noscheme (starting with "//" but without authority), path-rootless (no leading "/"), or path-empty. Path-absolute forms, such as those beginning with "/", denote an absolute path from the root, while path-rootless forms, like "resource" without a leading slash, indicate a relative path starting from the current level.[27]
Percent-encoding is used to represent characters outside the unreserved set (ALPHA, DIGIT, "-", ".", "_", "~") or reserved set (gen-delims: ":", "/", "?", "#", "[", "]", "@"; sub-delims: "!", "$", "&", "'", "(", ")", "*", "+", ",", ";", "=") when they appear as data rather than delimiters. Non-ASCII characters are first encoded into UTF-8 octet sequences, then each octet is percent-encoded as "%" followed by two uppercase hexadecimal digits; for example, the space character (U+0020) becomes "%20", the forward slash "/" (when used as data) becomes "%2F", and the question mark "?" becomes "%3F". Reserved characters must be encoded if their literal interpretation would alter parsing, such as encoding "/" in a path segment to prevent it from being treated as a delimiter.[11]
Invalid URIs violate these syntax rules and may lead to parsing failures or security issues. Common errors include unencoded spaces, which are not allowed in any component and must be percent-encoded as "%20"; mismatched brackets, such as an unclosed "[" or "]" in the authority (e.g., in IPv6 addresses), rendering the URI syntactically invalid; or improper use of percent-encoding, like lowercase hexadecimal digits (though normalization allows them, strict validation prefers uppercase). Implementations should reject or normalize such cases to ensure interoperability, as older systems might mishandle sequences like "/../" in queries as path traversals.[8]
Scheme-Specific Elements
The scheme component of a URI serves as the initial identifier that specifies the protocol, namespace, or access method for the resource, enabling a federated and extensible naming system across different applications and environments.[28] Each URI scheme defines its own syntax and semantics, which may impose restrictions or extensions on the generic URI structure defined in RFC 3986, while adhering to the overall absolute-URI grammar.[29] For instance, schemes like "http" indicate the Hypertext Transfer Protocol, while "urn" denotes a Uniform Resource Name namespace for persistent identifiers.[30] Common schemes exhibit distinct syntactic requirements. The "http" scheme mandates an authority component with a host and optional port, where the host identifies the target server and the port defaults to 80 if omitted; for example,http://[example.com](/page/Example.com) is equivalent to http://[example.com](/page/Example.com):80.[30] In contrast, the "file" scheme primarily uses a path-only structure for local file access, such as file:/etc/hosts, which is platform-dependent—POSIX systems start with a root slash, while Windows supports drive letters like file:c:/path/file.txt—and treats an empty or "localhost" authority as referring to the local host.[31] The "data" scheme embeds inline data directly, following the syntax data:[<mediatype>][;base64],<data>, where the media type (defaulting to text/plain;charset=US-ASCII) specifies the content format, and the data is either URL-encoded or base64-encoded; an example is data:text/plain;[base64](/page/Base64),SGVsbG8gd29ybGQ=.[32] The "mailto" scheme, used for email addresses, consists of an email address optionally followed by headers in the query-like portion, such as mailto:user@[example.com](/page/Example.com)?subject=Hello.[33]
Authority components vary by scheme, reflecting security and usability considerations. In the "http" scheme, the userinfo subcomponent (e.g., username:[password](/page/Password)@host) is deprecated due to risks of exposing credentials in logs or referrals, and implementations should treat its presence as an error.[34] Port defaults are scheme-specific: 80 for "http", 443 for "https", and none for schemes like "file" that do not use network authorities.[30] Schemes without an authority, such as "data" or "mailto", omit the double slash (//) and proceed directly to the path or data.[32][33]
Query and fragment handling also adapts to scheme semantics. For "http", the query component (?key=value) carries non-hierarchical parameters for resource selection, such as http://example.com/search?q=uri, while the fragment (#anchor) identifies a secondary resource or location within the primary one, like a document section, processed client-side without server transmission.[30] In the "file" scheme, queries are not used, and fragments may reference byte ranges or other file-specific anchors if supported by the implementation.[35] The "data" scheme treats any post-comma content as opaque data without separate query or fragment support, though fragments can be appended for media-type-specific dereferencing.[36] For "mailto", the query-like part holds email headers (e.g., [email protected]&[email protected]), but true fragments are not defined.[37]
URI schemes are registered with the Internet Assigned Numbers Authority (IANA) to ensure uniqueness and interoperability, following procedures outlined in BCP 35 (RFC 7595) for expert review or first-come-first-served allocation.[38] The registry includes permanent, provisional, and historical entries, with 349 schemes documented as of November 2025. Common registered schemes encompass ftp (File Transfer Protocol), ldap (Lightweight Directory Access Protocol), tel (Telephone), coap (Constrained Application Protocol), sip (Session Initiation Protocol), and the examples noted above, each referencing a defining RFC for precise syntax.[38]
URI Variants
Uniform Resource Locators (URLs)
A Uniform Resource Locator (URL) is a subset of Uniform Resource Identifiers (URIs) that not only identifies a resource but also provides a specific mechanism for locating and accessing it, typically over a network such as the Internet.[39] Unlike more general URIs, URLs incorporate scheme-specific details that enable retrieval, such as network protocols like HTTP or FTP.[40] This focus on location makes URLs essential for web addressing and resource fetching in distributed systems.[39] The term "URL" was coined in RFC 1738, published in December 1994, which formalized the syntax and semantics for locating resources available via the Internet as part of the World Wide Web initiative.[40] This specification built on earlier concepts from RFC 1630 and established URLs as compact string representations for Internet-accessible resources.[40] Over time, URLs have become synonymous with web addresses, evolving alongside web technologies while maintaining their core role in resource location.[39] In terms of structure, URLs for network schemes—such as those using HTTP or HTTPS—require a mandatory authority component, which includes the host (e.g., a domain name or IP address) and optionally a port and user information, prefixed by "//".[39] This is followed by a path that specifies the resource within the host, along with optional query parameters for additional data and a fragment identifier for intra-document navigation.[40] The general form adheres to the URI syntax but emphasizes locatability through the scheme's access method.[39] For example, consider the URLhttps://www.example.com/page?query=1#fragment:
- Scheme:
httpsindicates a secure HTTP connection.[40] - Authority:
www.example.comspecifies the host.[39] - Path:
/pageidentifies the resource.[39] - Query:
?query=1passes parameters to the resource.[40] - Fragment:
#fragmenttargets a section within the resource.[39]
This breakdown illustrates how URLs encode both location and access details hierarchically.[40]
Uniform Resource Names (URNs)
A Uniform Resource Name (URN) is a Uniform Resource Identifier (URI) that uses the "urn" scheme to provide a persistent, location-independent name for a resource.[43] Originally specified in 1997, URNs serve as abstract identifiers that remain stable over time, enabling the naming of entities such as documents, books, or individuals without reference to their current location.[44] Unlike locators, URNs focus on identification rather than retrieval, supporting long-term reference in systems where resources may migrate or change access points.[43] The syntax of a URN follows the formurn:<NID>:<NSS>, where <NID> is the Namespace Identifier—a registered string of alphanumeric characters and hyphens that defines the naming authority—and <NSS> is the Namespace-Specific String, which carries the unique identifier within that namespace.[44] The <NID> is case-insensitive and limited to 1-32 characters, while the <NSS> may include percent-encoded characters to handle reserved or non-ASCII data.[43] This structure ensures global uniqueness and compatibility with URI parsing rules. For instance, urn:isbn:0-306-40615-2 identifies a specific book using the ISBN namespace.[44]
Namespace Identifiers (NIDs) are formally registered with the Internet Assigned Numbers Authority (IANA) to prevent collisions and maintain interoperability; examples include "isbn" for International Standard Book Numbers and "oid" for Object Identifiers used in standards like ASN.1.[45] Registration follows an expert review process outlined in RFC 8141, ensuring each namespace has a defined assignment and resolution policy.[43] URNs can be resolved through dedicated resolvers that map the identifier to metadata, alternative representations, or locators as per the namespace's rules.[43]
Common examples illustrate URN applications: urn:ietf:rfc:2141 names the original URN syntax document itself, providing a stable reference for IETF standards, while namespaces like "mpeg" enable URNs for multimedia objects, such as urn:mpeg:url:abc123 for an MPEG-encoded resource.[44][45] These demonstrate how URNs support diverse, enduring naming needs across digital ecosystems.
References and Resolution
URI References
A URI reference is a string that can represent either an absolute URI, a relative reference, or an empty string, serving as a compact means to identify resources relative to a base URI.[46] This form allows for flexible referencing in documents and protocols without requiring full absolute paths.[46] Relative references follow the syntaxrelative-ref = relative-part [ "?" query ] [ "#" fragment ], where the relative-part can be a network-path (starting with "//"), an absolute-path (starting with "/"), a rootless path (starting with a segment but no "/"), or an empty path.[47] Absolute paths begin with a slash and denote a path from the root, rootless paths start directly with a non-empty segment for subdirectories, and empty paths indicate the base URI itself without modification.[14] The optional query and fragment components append parameters or internal anchors as in absolute URIs.[47]
To resolve a relative reference into an absolute URI, the process merges it with a base URI through a defined algorithm.[48] First, the base URI is parsed into its components: scheme, authority, path, query, and fragment.[49] If the relative reference includes a scheme, it is treated as absolute; otherwise, the base scheme and authority are retained unless the reference starts with "//", in which case only the authority is replaced.[50] Paths are then merged by appending the relative path to the base path (after removing the last segment if necessary) and resolving dot-segments: "." represents the current directory and is removed, while ".." ascends to the parent directory, with a two-buffer mechanism to handle these iteratively.[51][52] Query and fragment parts from the reference override those of the base if present.[50]
For example, given a base URI of http://a/b/c/d;p?q, the relative reference g resolves to http://a/b/c/g by appending to the base path; ../g resolves to http://a/b/g by removing the last two segments before appending; and /g resolves to http://a/g by replacing the entire path.[53] Another common case is ./image.jpg relative to http://[example.com](/page/Example.com)/dir/, which resolves to http://[example.com](/page/Example.com)/dir/image.jpg after removing the "." segment.[52]
URI references are widely used in markup languages for hyperlinks and resource inclusion, such as in HTML's <a href=""> and <img src=""> attributes, where they resolve against the document's base URI set by the <base> element. In XML, the xml:base attribute establishes a base URI for resolving relative references within elements, processing instructions, or entity content.[54] This enables modular document structures, like linking to local images or stylesheets without absolute paths.[54]
Resolution Mechanisms
Resolution of a Uniform Resource Identifier (URI) refers to the process of mapping the identifier to the corresponding resource through dereferencing, which involves determining the access mechanism and parameters based on the URI's scheme and components.[2] This mechanism enables applications to locate and interact with resources without requiring prior knowledge of their exact representation or location.[55] The resolution process begins with parsing the URI into its components: scheme, authority (including host and port), path, query, and fragment, as defined by the generic syntax.[8] The scheme dictates the protocol or handler to use, such as TCP/IP for hierarchical schemes.[4] Next, the authority component is contacted: for hostnames, this typically involves Domain Name System (DNS) resolution to obtain an IP address, followed by establishing a connection to the specified port (defaulting to scheme-specific values, like port 80 for HTTP).[56] The path and query components then guide the request to the specific resource within the authority's namespace.[14] Delegation in URI resolution allows hierarchical administration of the namespace, where the authority component enables a central registry to assign sub-namespaces to delegated entities.[9] For instance, in schemes using registered names, DNS provides a distributed delegation model, resolving hostnames through a tree of authoritative servers.[56] This structure supports scalable resource location without a single point of control. In the HTTP scheme, resolution occurs over TCP/IP: after DNS resolves the host to an IP, a client connects to the port, sends a GET request with the path and query, and receives the resource or a response code. For the URN scheme, resolution is namespace-specific, often involving dedicated resolvers that map the URN to locators via protocols like NAPTR DNS records or HTTP-based services. Unlike location-based schemes, URN resolution emphasizes persistence and may not yield direct access but rather equivalent URIs. Error handling during resolution is scheme-dependent; for example, in HTTP, if the resource is unavailable, the server returns a 404 Not Found status code. Redirects are managed through 3xx status codes, instructing the client to follow an alternative URI for the resource. Invalid URIs or unreachable authorities may result in connection failures or protocol-specific errors, prompting applications to flag or retry as appropriate.[57]Applications and Extensions
Use in Web Technologies
URIs play a foundational role in the Hypertext Transfer Protocol (HTTP), where they form the request-target that identifies the primary resource upon which an HTTP method is applied, such as GET for retrieval or POST for submission in RESTful architectures.[23] This usage enables precise addressing of resources on the server, supporting stateless interactions where the URI alone suffices to locate and operate on the target without additional session state.[23] In API design, URIs delineate endpoints that embody REST principles, allowing clients to manipulate resources through standardized methods while facilitating scalability and interoperability across distributed systems. In markup languages like HTML and XML, URIs integrate seamlessly to enable linking and resource embedding. Thehref attribute in HTML's <a> element specifies a URI reference for hyperlinks, directing users or agents to connected documents or sections, while the src attribute in elements like <img> or <script> denotes a URI for loading external media or code.[58] Similarly, XML's XLink specification employs the href attribute to embed URI-based locators within elements, supporting bidirectional, multi-ended, and out-of-line links that extend beyond simple anchors to complex traversals in XML documents.[59]
Within the Semantic Web, URIs function as unique, global identifiers for abstract resources in RDF and OWL ontologies, where HTTP URIs are preferred for their dereferenceability—allowing retrieval of machine-readable descriptions (e.g., RDF/XML) via standard HTTP GET requests when accessed. This design promotes linked data principles, enabling automated discovery and integration of knowledge across the web by resolving identifiers to informative representations.
Contemporary web technologies extend URI applications to interactive and service-oriented protocols. WebSockets leverage the ws:// and wss:// URI schemes to initiate bidirectional communication channels over HTTP, with the URI specifying the server host, port, and resource path for the upgrade handshake.[60] Service workers register via a script URL and define an associated scope URL, intercepting fetch requests within that scope to enable offline functionality and caching.[61] GraphQL APIs typically expose a single HTTP endpoint URI (e.g., /graphql) for POST requests containing queries, allowing flexible data retrieval without multiple resource-specific URIs.[62]
URIs also underpin authentication and state management in web ecosystems. In HTTP cookies, the Domain and Path attributes derive from the request URI to scope cookie applicability, ensuring state is tied to specific origins.[63] OAuth 2.0 employs URIs for critical parameters like redirect_uri, which specifies the client endpoint for returning authorization codes or tokens, and client_id, a unique identifier for the client application.[64] Additionally, content negotiation in HTTP uses the request URI in conjunction with Accept headers to select resource variants, such as different media types or languages, based on client preferences.
Internationalization and IRIs
Internationalized Resource Identifiers (IRIs) extend the URI framework to support characters from the Universal Character Set (UCS), also known as Unicode or ISO 10646, enabling the use of non-ASCII scripts in resource identifiers.[25] Defined in RFC 3987 published in January 2005, an IRI is a sequence of characters that allows internationalized text while maintaining compatibility with existing URI infrastructure.[25] This extension addresses the limitations of URIs, which are restricted to ASCII characters, by permitting native representation of scripts such as Chinese, Arabic, or Cyrillic directly in the identifier.[25] The syntax of an IRI closely mirrors that of a URI, as outlined in RFC 3986, but replaces the unreserved character set with an expanded set that includes UCS characters (denoted as UCSCHAR in the Augmented Backus-Naur Form or ABNF grammar).[25] Specifically, IRI components like the scheme, authority, path, query, and fragment follow the same hierarchical structure, but non-ASCII characters are allowed in positions where URIs permit unreserved characters, with reserved characters (such as /, ?, and #) retaining their delimiters.[25] For instance, the authority component can include internationalized domain names via Internationalizing Domain Names in Applications (IDNA), while path and query segments support UCS characters without immediate encoding.[25] To ensure interoperability with URI-based systems, IRIs are mapped to URIs through a process involving UTF-8 encoding followed by percent-encoding of non-ASCII octets.[25] The conversion algorithm first transforms the IRI's UCS characters (excluding those in the authority's ireg-name) into UTF-8 byte sequences, then applies percent-encoding to any bytes outside the US-ASCII range, producing a valid URI.[25] For the domain name portion (ireg-name), the toASCII algorithm from RFC 3490 (Punycode) is applied to convert internationalized labels to ASCII Compatible Encoding (ACE) form, prefixed with "xn--".[25] Conversely, the toUnicode algorithm reverses this process, decoding percent-encoded sequences back to UTF-8 and interpreting ACE domains as Unicode labels where supported.[25] These mappings ensure that IRIs can be processed in legacy URI environments without loss of information. IRIs have seen widespread adoption in web standards and implementations, particularly for global accessibility. The HTML Living Standard requires support for IRI semantics in URL handling, including parsing and serialization, to accommodate internationalized content in attributes like href. Modern web browsers handle IRIs by converting internationalized domain names to Punycode for DNS resolution while displaying the native script to users, as per IDNA guidelines; for example, Chrome and Firefox apply these conversions transparently in the address bar and link processing.[65] Protocols like HTTP/1.1 and HTML5 further integrate IRI support, allowing non-ASCII characters in headers and document references when encoded appropriately.[25] Despite these advancements, IRIs present challenges related to text rendering and equivalence. Bidirectional text in scripts like Arabic or Hebrew requires logical storage order and application of the Unicode Bidirectional Algorithm, with restrictions prohibiting mixed-direction components within a single IRI to avoid visual confusion or security risks.[25] Normalization is another key issue; IRIs should be represented in Unicode Normalization Form C (NFC) to mitigate variations from different normalization forms, ensuring consistent comparison across systems—simple string matching or syntax-based normalization can then determine equivalence.[25] For example, the IRIhttp://例.com/ページ maps to the URI http://xn--fsq.com/%E3%83%9A%E3%83%BC%E3%82%B8 , where the domain is Punycode-encoded and the path segment is UTF-8 percent-encoded.[25] Similarly, http://résumé.example.org as an IRI becomes http://xn--rsum-bpad.example.org in URI form, demonstrating domain Punycode application without path encoding if ASCII.[25] These conversions highlight how IRIs facilitate multilingual web navigation while preserving URI compatibility.[25]
Considerations
Normalization and Munging
Normalization standardizes URI representations to enable accurate comparison and determination of equivalence without accessing the referenced resource. The process, outlined in RFC 3986, involves syntax-based adjustments to eliminate variations that do not affect the identified resource.[66] These adjustments ensure that equivalent URIs, such as those differing only in case or encoding, are transformed into identical forms for syntactic equivalence.[67] Case normalization converts the scheme and host components to lowercase, as they are case-insensitive. For example, the URI "HTTP://www.EXAMPLE.com/" normalizes to "http://www.example.com/". Hexadecimal digits within percent-encoded octets are also normalized to uppercase for consistency, treating "%3a" and "%3A" as equivalent.[68] Percent-encoding normalization decodes any percent-encoded octets that represent unreserved characters (such as A-Z, a-z, 0-9, hyphen, period, underscore, and tilde), removing unnecessary encodings like "%20" for a space where direct representation is allowed.[69] Path segment normalization applies the remove_dot_segments algorithm to eliminate "." and ".." segments, simplifying paths like "/docs/./../docs" to "/docs".[70] After these transformations, syntactic equivalence is assessed by character-by-character comparison of the normalized strings; identical results indicate the URIs reference the same resource syntactically.[71] Semantic equivalence builds on this by incorporating scheme-specific rules, such as treating an empty path in HTTP URIs as equivalent to a path of "/". For instance, "http://example.com", "http://example.com/", and "http://example.com:80/" are semantically equivalent under HTTP rules.[72] URL munging involves unauthorized or ad-hoc modifications to URIs that can alter their equivalence or cause resolution failures. Common practices include prepending "www." to the host component, such as changing "example.com" to "www.example.com", which may lead to errors if the server does not configure the subdomain equivalently. Another frequent alteration is appending or removing trailing slashes from paths, potentially creating duplicate content or triggering unintended redirects; for example, "http://example.com/page" and "http://example.com/page/" might resolve differently depending on server configuration.[73] Such changes disrupt canonical forms and can result in broken links or inconsistent resource access. Best practices for handling normalization include using established canonicalization algorithms in programming libraries. Python's urllib.parse module, for instance, provides functions like urlsplit and urlunsplit that perform case normalization on the scheme and host, decode percent-encodings appropriately, and handle path components, producing a standardized representation compliant with RFC 3986 basics.[74] Implementations should apply full syntax-based normalization before comparison to avoid false non-equivalences, prioritizing these steps over scheme-specific adjustments unless required for the application context.Security Implications
Uniform Resource Identifiers (URIs) introduce several security risks due to their role in directing resource access, particularly when parsed or resolved without proper safeguards. Open redirects occur when applications accept untrusted URI inputs for redirection without validation, allowing attackers to manipulate users into visiting malicious sites, often as a precursor to phishing or credential harvesting.[75] Similarly, injection attacks exploit query parameters or fragments in URIs; for instance, unescaped inputs in query strings can lead to cross-site scripting (XSS) if reflected into web pages, while fragments may trigger client-side script execution in vulnerable browsers.[76] Scheme-specific threats amplify these vulnerabilities. Thejavascript: URI scheme enables direct execution of JavaScript code in the context of the current page, facilitating XSS attacks by injecting malicious scripts when users click or navigate to such links, as browsers historically allowed this for backward compatibility. The data: URI scheme, which embeds data directly into the URI, poses phishing risks by allowing attackers to craft self-contained pages mimicking legitimate sites, bypassing external hosting and evading some URL filters.
Historical incidents highlight the real-world impact of URI-related exploits. In the 2010s, URL shortening services like bit.ly were abused in campaigns such as the Koobface worm, which used shortened URIs to redirect users to malware downloads, spreading via social media and infecting thousands of systems.[77] These exploits often combined open redirects with obfuscated malicious payloads, demonstrating how URI opacity can facilitate large-scale attacks.
Mitigations focus on defensive handling of URIs during parsing and resolution. URI validation involves checking schemes, hosts, and parameters against whitelists to block untrusted inputs, while browser sandboxing isolates URI processing to prevent privilege escalation from malicious schemes.[78] Content-Security-Policy (CSP) headers provide an additional layer by restricting executable scripts and navigations, effectively blocking javascript: and certain data: executions in modern browsers.[79]
Best practices emphasize proactive design to minimize exposure. Developers should avoid the deprecated userinfo component (e.g., username:password@host) in URIs, as it exposes credentials in logs and browser histories; instead, use secure alternatives like HTTPS with authentication headers.[80] Always validate allowed schemes (e.g., restricting to https:) and enforce HTTPS to encrypt URIs in transit, preventing interception of sensitive parameters during resolution.[81]