Recent from talks
Contribute something
Nothing was collected or created yet.
| URL | |
|---|---|
| Uniform resource locator | |
| Abbreviation | URL |
| Status | Published |
| First published | 1994 |
| Latest version | Living Standard 2023 |
| Organization | Internet Engineering Task Force (IETF) |
| Committee | Web Hypertext Application Technology Working Group (WHATWG) |
| Series | Request for Comments (RFC) |
| Editors | Anne van Kesteren |
| Authors | Tim Berners-Lee |
| Base standards | |
| Related standards | URI, URN |
| Domain | World Wide Web |
| License | CC BY 4.0 |
| Website | url |
A uniform resource locator (URL), colloquially known as an address on the Web,[6] is a reference to a resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifier (URI),[7][1] although many people use the two terms interchangeably.[8][a] URLs occur most commonly to reference web pages (HTTP/HTTPS) but are also used for file transfer (FTP), email (mailto), database access (JDBC), and many other applications.
Most web browsers display the URL of a web page above the page in an address bar. A typical URL could have the form http://www.example.com/index.html, which indicates a protocol (http), a hostname (www.example.com), and a file name (index.html).
History
[edit]Uniform Resource Locators were defined in RFC 1738[10] in 1994 by Tim Berners-Lee, the inventor of the World Wide Web, and the URI working group of the Internet Engineering Task Force (IETF),[11] as an outcome of collaboration started at the IETF Living Documents birds of a feather session in 1992.[11][12]
The format combines the pre-existing system of domain names (created in 1985) with file path syntax, where slashes are used to separate directory and filenames. Conventions already existed where server names could be prefixed to complete file paths, preceded by a double slash (//).[13]
Berners-Lee later expressed regret at the use of dots to separate the parts of the domain name within URIs, wishing he had used slashes throughout,[13] and also said that, given the colon following the first component of a URI, the two slashes before the domain name were unnecessary.[14]
Early WorldWideWeb collaborators, including Berners-Lee, originally proposed the use of UDIs: Universal Document Identifiers. An early (1993) draft of the HTML Specification[15] referred to "Universal" Resource Locators. This was dropped some time between June 1994[16] and October 1994.[17] In his book Weaving the Web, Berners-Lee emphasizes his preference for the original inclusion of "universal" in the expansion rather than the word "uniform", to which it was later changed, and he gives a brief account of the contention that led to the change.
Syntax
[edit]Every HTTP URL conforms to the syntax of a generic URI. The URI generic syntax consists of five components organized hierarchically in order of decreasing significance from left to right:[1]: §3
URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment]
A component is undefined if it has an associated delimiter and the delimiter does not appear in the URI; the scheme and path components are always defined.[1]: §5.2.1 A component is empty if it has no characters; the scheme component is always non-empty.[1]: §3
The authority component consists of subcomponents:
authority = [userinfo "@"] host [":" port]
This is represented in a syntax diagram as:
The URI comprises:
- A non-empty scheme component followed by a colon (
:), consisting of a sequence of characters beginning with a letter and followed by any combination of letters, digits, plus (+), period (.), or hyphen (-). Although schemes are case-insensitive, the canonical form is lowercase and documents that specify schemes must do so with lowercase letters. Examples of popular schemes includehttp,https,ftp,mailto,file,dataandirc. URI schemes should be registered with the Internet Assigned Numbers Authority (IANA), although non-registered schemes are used in practice.[18] - An optional authority component preceded by two slashes (
//), comprising:- An optional userinfo subcomponent followed by an at symbol (
@), that may consist of a user name and an optional password preceded by a colon (:). Use of the formatusername:passwordin the userinfo subcomponent is deprecated for security reasons. Applications should not render as clear text any data after the first colon (:) found within a userinfo subcomponent unless the data after the colon is the empty string (indicating no password). - A host subcomponent, consisting of either a registered name (including but not limited to a hostname) or an IP address. IPv4 addresses must be in dot-decimal notation, and IPv6 addresses must be enclosed in brackets (
[]).[1]: §3.2.2 [b] - An optional port subcomponent preceded by a colon (
:), consisting of decimal digits.
- An optional userinfo subcomponent followed by an at symbol (
- A path component, consisting of a sequence of path segments separated by a slash (
/). A path is always defined for a URI, though the defined path may be empty (zero length). A segment may also be empty, resulting in two consecutive slashes (//) in the path component. A path component may resemble or map exactly to a file system path but does not always imply a relation to one. If an authority component is defined, then the path component must either be empty or begin with a slash (/). If an authority component is undefined, then the path cannot begin with an empty segment—that is, with two slashes (//)—since the following characters would be interpreted as an authority component.[20]: §3.3
- By convention, in http and https URIs, the last part of a path is named pathinfo and it is optional. It is composed by zero or more path segments that do not refer to an existing physical resource name (e.g. a file, an internal module program or an executable program) but to a logical part (e.g. a command or a qualifier part) that has to be passed separately to the first part of the path that identifies an executable module or program managed by a web server; this is often used to select dynamic content (a document, etc.) or to tailor it as requested (see also: CGI and PATH_INFO, etc.).
- Example:
- URI:
"http://www.example.com/questions/3456/my-document" - where:
"/questions"is the first part of the path (an executable module or program) and"/3456/my-document"is the second part of the path named pathinfo, which is passed to the executable module or program named"/questions"to select the requested document.
- URI:
- An http or https URI containing a pathinfo part without a query part may also be referred to as a 'clean URL,' whose last part may be a 'slug.'
| Query delimiter | Example |
|---|---|
Ampersand (&)
|
key1=value1&key2=value2
|
Semicolon (;)[c]
|
key1=value1;key2=value2
|
- An optional query component preceded by a question mark (
?), consisting of a query string of non-hierarchical data. Its syntax is not well defined, but by convention is most often a sequence of attribute–value pairs separated by a delimiter. - An optional fragment component preceded by a hash (
#). The fragment contains a fragment identifier providing direction to a secondary resource, such as a section heading in an article identified by the remainder of the URI. When the primary resource is an HTML document, the fragment is often anidattribute of a specific element, and web browsers will scroll this element into view.
A web browser will usually dereference a URL by performing an HTTP request to the specified host, by default on port number 80. URLs using the https scheme require that requests and responses be made over a secure connection to the website.
Internationalized URL
[edit]Internet users are distributed throughout the world using a wide variety of languages and alphabets, and expect to be able to create URLs in their own local alphabets. An Internationalized Resource Identifier (IRI) is a form of URL that includes Unicode characters. All modern browsers support IRIs. The parts of the URL requiring special treatment for different alphabets are the domain name and path.[23][24]
The domain name in the IRI is known as an Internationalized Domain Name (IDN). Web and Internet software automatically convert the domain name into punycode usable by the Domain Name System; for example, the Chinese URL http://例子.卷筒纸 becomes http://xn--fsqu00a.xn--3lr804guic/. The xn-- indicates that the character was not originally ASCII.[25]
The URL path name can also be specified by the user in the local writing system. If not already encoded, it is converted to UTF-8, and any characters not part of the basic URL character set are escaped as hexadecimal using percent-encoding; for example, the Japanese URL http://example.com/引き割り.html becomes http://example.com/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html. The target computer decodes the address and displays the page.[23]
Protocol-relative URLs
[edit]Protocol-relative links (PRL), also known as protocol-relative URLs (PRURL), are URLs that have no protocol specified. For example, //example.com will use the protocol of the current page, typically HTTP or HTTPS.[26][27]
See also
[edit]- Hyperlink
- PURL – Persistent URL
- CURIE (Compact URI)
- URI fragment
- Internet resource locator (IRL)
- Internationalized Resource Identifier (IRI)
- Clean URL
- Typosquatting
- Uniform Resource Identifier
- URI normalization
- Use of slashes in networking
External links
[edit]- URL specification at WHATWG
- URL splitter that splits any URI into its parts
Notes
[edit]- ^ A URL implies the means to access an indicated resource and is denoted by a protocol or an access mechanism, which is not true of every URI.[1][8] Thus
http://www.example.comis a URL, whilewww.example.comis not.[9] - ^ For URIs relating to resources on the World Wide Web, some web browsers allow
.0portions of dot-decimal notation to be dropped or raw integer IP addresses to be used.[19] - ^ Historic RFC 1866 (obsoleted by RFC 2854[21]) encourages CGI authors to support ';' in addition to '&'.[22]: §8.2.1
Citations
[edit]- ^ a b c d e f g T. Berners-Lee; R. Fielding; L. Masinter (January 2005). Uniform Resource Identifier (URI): Generic Syntax. Network Working Group. doi:10.17487/RFC3986. STD 66. RFC 3986. Internet Standard 66. Obsoletes RFC 2732, 2396 and 1808. Updated by RFC 6874, 7320 and 8820. Updates RFC 1738.
- ^ P. Hoffman (October 2005). The telnet URI Scheme. Network Working Group. doi:10.17487/RFC4248. RFC 4248. Proposed Standard. Obsoletes RFC 1738.
- ^ P. Hoffman (November 2005). The gopher URI Scheme. Network Working Group. doi:10.17487/RFC4266. RFC 4266. Proposed Standard. Obsoletes RFC 1738.
- ^ M. Duerst; L. Masinter; J. Zawinski (October 2010). The 'mailto' URI Scheme. Internet Engineering Task Force. doi:10.17487/RFC6068. ISSN 2070-1721. RFC 6068. Proposed Standard. Obsoletes RFC 2368.
- ^ M. Yevstifeyev (June 2011). The 'tn3270' URI Scheme. Internet Engineering Task Force. doi:10.17487/RFC6270. ISSN 2070-1721. RFC 6270. Proposed Standard. Updates RFC 1041, 1738 and 2355.
- ^ W3C (2009).
- ^ "Forward and Backslashes in URLs". zzz.buzz. Archived from the original on 2018-09-04. Retrieved 2018-09-19.
- ^ a b Mealling, Michael H.; Denenberg, Ray (August 2002). Report from the Joint W3C/IETF URI Planning Interest Group: Uniform Resource Identifiers (URIs), URLs, and Uniform Resource Names (URNs): Clarifications and Recommendations. Network Working Group. doi:10.17487/RFC3305. RFC 3305. Informational.
- ^ Miessler, Daniel. "The Difference Between URLs and URIs". Archived from the original on 2017-03-17. Retrieved 2017-03-16.
- ^ T. Berners-Lee; L. Masinter; M. McCahill (December 1994). Uniform Resource Locators (URL). Network Working Group. doi:10.17487/RFC1738. RFC 1738. Obsolete. Obsoleted by RFC 4248 and 4266. Updated by RFC 1808, 2368, 2396, 3986, 6196, 6270 and 8089.
- ^ a b W3C (1994).
- ^ IETF (1992).
- ^ a b Berners-Lee (2015).
- ^ BBC News (2009).
- ^ Berners-Lee, Tim; Connolly, Daniel "Dan" (March 1993). Hypertext Markup Language (draft RFCxxx) (Technical report). p. 28. Archived from the original on 2017-10-23. Retrieved 2017-10-23.
- ^ Berners-Lee, Tim (June 1994). Universal Resource Identifiers in WWW: A Unifying Syntax for the Expression of Names and Addresses of Objects on the Network as used in the World-Wide Web. Network Working Group. doi:10.17487/RFC1630. RFC 1630. Informational.
- ^ Berners-Lee, Tim; Masinter, Larry; McCahill, Mark Perry (October 1994). Uniform Resource Locators (URL) (Technical report). (This Internet-Draft was published later that year as RFC 1738) Cited in Ang, C. S.; Martin, D. C. (January 1995). Constituent Component Interface++ (Technical report). UCSF Library and Center for Knowledge Management. Archived from the original on 2017-10-23. Retrieved 2017-10-23.
- ^ Hansen, Tony; Hardie, Ted (June 2015). Thaler, Dave (ed.). Guidelines and Registration Procedures for URI Schemes. Internet Engineering Task Force. doi:10.17487/RFC7595. ISSN 2070-1721. BCP 35. RFC 7595. Best Current Practice 35. Updated by RFC 8615. Obsoletes RFC 4395.
- ^ Lawrence (2014).
- ^ T. Berners-Lee; R. Fielding; L. Masinter (August 1998). Uniform Resource Identifiers (URI): Generic Syntax. Network Working Group. doi:10.17487/RFC2396. RFC 2396. Obsolete. Obsoleted by RFC 3986. Updated by RFC 2732. Updates RFC 1808 and 1738.
- ^ D. Connolly; L. Masinter (June 2000). The 'text/html' Media Type. Network Working Group. doi:10.17487/RFC2854. RFC 2854. Informational / Legacy. Obsoletes RFC 1980, 1867, 1942, 1866 and 2070. Not endorsed by the IETF.
- ^ Berners-Lee, Tim; Connolly, Daniel W. (November 1995). Hypertext Markup Language - 2.0. Network Working Group. doi:10.17487/RFC1866. RFC 1866. Historic. Obsoleted by RFC 2854.
- ^ a b W3C (2008).
- ^ W3C (2014).
- ^ IANA (2003).
- ^ Glaser, J. D. (2014-03-10). Secure Development for Mobile Apps: How to Design and Code Secure Mobile Applications with PHP and JavaScript (1st ed.). CRC Press. p. 193. ISBN 978-1-48220903-7. Retrieved 2015-10-12.
- ^ Schafer, Steven M. (2011). HTML, XHTML, and CSS Bible (1st ed.). John Wiley & Sons. p. 124. ISBN 978-1-11808130-3. Retrieved 2015-10-12.
References
[edit]- "Berners-Lee "sorry" for slashes". BBC News. 2009-10-14. Archived from the original on 2020-06-05. Retrieved 2010-02-14.
- "Living Documents BoF Minutes". World Wide Web Consortium. 1992-03-18. Archived from the original on 2012-11-22. Retrieved 2011-12-26.
- Berners-Lee, Tim (1994-03-21). "Uniform Resource Locators (URL): A Syntax for the Expression of Access Information of Objects on the Network". World Wide Web Consortium. Archived from the original on 2015-09-09. Retrieved 2015-09-13.
- Berners-Lee, Tim (2015) [2000]. "Why the //, #, etc?". Frequently asked questions. World Wide Web Consortium. Archived from the original on 2020-05-14. Retrieved 2010-02-03.
- Connolly, Daniel "Dan"; Sperberg-McQueen, C. Michael, eds. (2009-05-21). "Web addresses in HTML 5". World Wide Web Consortium. Archived from the original on 2015-07-10. Retrieved 2015-09-13.
- IANA (2003-02-14). "Completion of IANA Selection of IDNA Prefix". IETF-Announce mailing list. Archived from the original on 2004-12-08. Retrieved 2015-09-03.
- "An Introduction to Multilingual Web Addresses". 2008-05-09. Archived from the original on 2015-01-05. Retrieved 2015-01-11.
- Phillip, A. (2014). "What is Happening with "International URLs"". World Wide Web Consortium. Archived from the original on 2015-02-17. Retrieved 2015-01-11.
- Lawrence, Eric (2014-03-06). "Browser Arcana: IP Literals in URLs". Microsoft Learn. Archived from the original on 2020-06-22. Retrieved 2020-06-22.
https) indicating the protocol, an optional authority part (including host and port), a path to the resource, an optional query string for parameters, and a fragment identifier for specific sections.[1] For example, in https://example.com/path?query=value#fragment, each element directs the retrieval process.[1] These elements must adhere to encoding rules, using percent-encoding for special characters to ensure safe transmission.[1]
URLs are foundational to the modern Web, powering hyperlinks, APIs, and data exchange, with approximately 1.2 billion websites relying on them as of 2025 for global resource access.[4] Their evolution continues through updates to URI standards, addressing issues like internationalization and security (e.g., via HTTPS).[1]
Fundamentals
Definition and Purpose
A Uniform Resource Locator (URL) is a specific type of Uniform Resource Identifier (URI) that not only identifies a resource but also specifies its primary access mechanism and network location, enabling retrieval over the internet.[5] This string-based reference follows a standardized format to denote both where a resource is located and how to access it, distinguishing it within the broader URI framework.[6] URLs were formally defined in 1994 through RFC 1738, authored by Tim Berners-Lee and colleagues as part of the early World Wide Web infrastructure.[6] The core purpose of a URL is to provide a compact, precise means for addressing and retrieving diverse internet resources, such as web pages, downloadable files, or online services.[6] For instance, the URL http://www.[example.com](/page/Example.com)/path/to/resource indicates the Hypertext Transfer Protocol (HTTP) for access, the domain name www.[example.com](/page/Example.com) as the host, and /path/to/resource as the specific location within that host's namespace.[6] By standardizing this addressing, URLs facilitate seamless navigation and interaction across distributed networks, forming the foundational mechanism for hyperlink-based systems like the web.[7] Key characteristics of URLs include their reliance on a consistent syntactic structure to ensure interoperability, while allowing for both absolute forms—which contain the complete address from protocol to resource path—and relative forms, which depend on a contextual base URL for resolution.[8] As a subset of URIs, URLs emphasize locatability alongside identification, prioritizing practical access over mere naming.[5]Relation to URI and URN
A Uniform Resource Identifier (URI) serves as a generic framework for identifying abstract or physical resources on the Internet, encompassing both names and locations through a standardized syntax and semantics. This framework, formalized in RFC 3986 published in January 2005 by the Internet Engineering Task Force (IETF), defines URIs as compact strings that enable uniform identification without specifying how to access the resource, allowing for flexibility across various protocols and systems. URIs include subclasses such as Uniform Resource Locators (URLs) and Uniform Resource Names (URNs), forming a hierarchical taxonomy for resource referencing. URLs represent a specific subset of URIs that not only identify a resource but also provide a mechanism for locating and accessing it, typically by specifying a protocol or scheme such as HTTP or FTP. In contrast to more abstract URIs, a URL's inclusion of an access method—often through its scheme component—enables direct retrieval, making it essential for web navigation and hypertext linking. This distinction was clarified in RFC 3986, which positions URLs as URIs with the additional attribute of denoting a resource's location and retrieval process. Uniform Resource Names (URNs), another subset of URIs, focus on providing persistent, location-independent names for resources, without implying any specific retrieval mechanism. Defined in RFC 2141 from May 1997, URNs use a syntax starting with "urn:" followed by a namespace identifier and name, such as "urn:isbn:0451450523" for a book, ensuring long-term stability even if the resource's location changes. Unlike URLs, URNs do not include schemes for access, emphasizing naming over location to support applications like digital libraries and metadata systems. Over time, the URI framework has evolved to address practical web implementation challenges, with the WHATWG URL Living Standard—last updated on 30 October 2025—refining URI syntax for better compatibility with modern browsers and web technologies.[9] This standard builds on RFC 3986 by incorporating parsing algorithms and handling edge cases specific to URL usage in HTML and JavaScript environments, while maintaining backward compatibility with the broader URI model. It underscores URLs' role in web addressing by aligning URI principles with real-world deployment needs, without altering the core distinctions between URIs, URLs, and URNs.Historical Development
Origins and Early Concepts
The origins of Uniform Resource Locators (URLs) trace back to the addressing mechanisms prevalent in the pre-web era of computer networking during the 1980s. The Domain Name System (DNS), introduced in 1985, established a hierarchical structure for naming internet hosts, transitioning from numeric IP addresses to human-readable domain names like symbolics.com, the first registered second-level domain.[10] This system built upon earlier ARPANET conventions, where file paths in protocols such as File Transfer Protocol (FTP)—formalized in the 1970s but extensively used in the 1980s—enabled users to specify locations of files on remote servers, forming a foundational model for resource identification.[11] Tim Berners-Lee's 1989 proposal at CERN for a hypertext-based information management system indirectly influenced URL development by highlighting the need for interconnected document access across distributed environments.[12] This vision evolved into early prototypes that integrated the Hypertext Transfer Protocol (HTTP) with addressable hyperlinks in Hypertext Markup Language (HTML), allowing documents to reference each other via simple locators and paving the way for a cohesive web infrastructure. A key event occurred on March 18, 1992, during a Birds of a Feather (BOF) session at the Internet Engineering Task Force (IETF) meeting, where Berners-Lee presented the World Wide Web and advocated for a unified addressing scheme to interlink diverse network information systems.[13] He proposed Universal Document Identifiers (UDIs) that prefixed protocol names (like HTTP or FTP) to resource handles, aiming to create a seamless naming convention. Initial challenges centered on the requirement for a universal locator capable of abstracting multiple protocols—including HTTP, FTP, and Gopher—while hiding implementation details from users to facilitate global resource discovery.[13]Formal Standardization and Evolution
The formal standardization of URLs began with RFC 1738, published by the Internet Engineering Task Force (IETF) in December 1994, which provided the first official specification for Uniform Resource Locators as a compact string representation for locating and accessing resources on the Internet.[3] This document outlined the basic syntax, including schemes such as HTTP, FTP, and Gopher, along with rules for encoding unsafe characters to ensure interoperability across network protocols.[3] In January 2005, RFC 3986 superseded earlier specifications by defining a generic syntax for Uniform Resource Identifiers (URIs), explicitly incorporating URLs as a subset focused on resource location via specific access methods.[1] This standard clarified the handling of percent-encoding for non-ASCII and reserved characters, distinguishing between unreserved characters that could remain literal and those requiring encoding to avoid delimiter conflicts, thereby improving precision in URI resolution.[14] Additionally, RFC 3986 introduced support for IPv6 addresses within the host component of URLs, using square bracket enclosure for literals like [2001:db8::1] to accommodate the expanded addressing needs of modern networks.[15] The Web Hypertext Application Technology Working Group (WHATWG) has driven ongoing evolution through its URL Living Standard, first developed in the mid-2000s and continuously updated to address practical web implementation challenges.[9] As of its latest revisions, this standard refines URL parsing to resolve inconsistencies among web browsers, providing detailed state-machine algorithms for decomposing URLs into components like scheme, host, and path while ensuring idempotent serialization.[9] It builds on RFC 3986 by prioritizing web-specific behaviors, such as robust handling of malformed inputs and enhanced JavaScript APIs for dynamic URL manipulation.[9] Criticisms of early URL design have influenced refinements, notably Tim Berners-Lee's 2009 reflection that the double slash (//) after the scheme was an unnecessary artifact from programming conventions, adding redundancy without functional benefit.[16] Subsequent updates, including those in RFC 3986 and the WHATWG standard, have incorporated such feedback by streamlining syntax where possible and extending support for emerging technologies like IPv6 to mitigate address exhaustion issues from IPv4.[15]Syntax and Components
Overall Structure
A Uniform Resource Locator (URL) adheres to the generic syntax of a Uniform Resource Identifier (URI), providing a structured format for identifying resources on the internet. The overall structure is defined asscheme ":" hier-part [ "?" query ] [ "#" fragment ], where the hier-part typically includes //authority followed by the path for network-based schemes. Delimiters such as : separate the scheme from the hierarchical part, // introduce the authority, ? precedes the query, and # denotes the fragment, ensuring unambiguous parsing of components.[1]
Absolute URLs include the full scheme and authority, enabling standalone resolution without additional context, as in https://example.com/path. In contrast, relative URLs omit the scheme and authority, relying on a base URL for resolution; for example, /path resolves relative to the directory of the base, while ../path navigates upward in the hierarchy. This distinction supports efficient referencing in documents like HTML, where relative forms reduce redundancy.[1]
URLs consist of characters that are either unreserved or reserved, with the former usable directly in most positions. Unreserved characters include alphanumeric digits (A-Z, a-z, 0-9) and the symbols -, ., _, and ~, which do not require encoding. Reserved characters, such as :, /, ?, #, [, ], @, !, $, &, ', (, ), *, +, ,, and =, serve special syntactic roles and must be percent-encoded (e.g., %3A for :) when used in data rather than delimiters to avoid misinterpretation.[1]
For instance, the URL http://user:[email protected]:80/path?key=value#section decomposes into the scheme http, authority user:[email protected]:80, path /path, query key=value, and fragment section, with delimiters clearly separating each part for resolution by clients like web browsers.[1]
Scheme and Authority
The scheme, also known as the protocol identifier, specifies the protocol or access method used to interact with the resource identified by the URL. According to RFC 3986, the scheme consists of a sequence of characters starting with a letter (A-Z, a-z) followed by zero or more alphanumeric characters, plus signs (+), periods (.), or hyphens (-), and it is case-insensitive, though it is recommended to express schemes in lowercase letters.[17] The scheme is immediately followed by a colon and two forward slashes (://), which delimit the beginning of the authority component if present.[18] Common schemes include "http" for Hypertext Transfer Protocol, "https" for secure HTTP, "ftp" for File Transfer Protocol, and "mailto" for email addresses. Each scheme may define a default port for network communication; for instance, the "http" scheme defaults to port 80 on TCP, while "https" defaults to port 443.[19] The authority component follows the scheme and double slash, providing the location of the resource server, and is optional in some URL contexts but required for hierarchical schemes like HTTP. It is structured as[userinfo "@"] host [":" port], where the userinfo subcomponent (if present) contains authentication credentials in the form of a username and optional password separated by a colon (e.g., user:pass@), though its use is discouraged due to security risks in modern implementations.[20][21] The host subcomponent identifies the server, either as a registered name (domain) resolved via the Domain Name System (DNS) or as an IP address literal.[15] For IPv4 addresses, the host is a dotted-decimal notation (e.g., 192.0.2.1), while IPv6 addresses must be enclosed in square brackets to distinguish them from port numbers (e.g., [2001:db8::1]).[15] The port subcomponent, if specified, is a decimal integer following a colon (e.g., :8080), indicating the network port; it is omitted if the default port for the scheme is used.[19]
Within the authority, characters are restricted to avoid ambiguity, with percent-encoding used to represent reserved or non-ASCII characters. Percent-encoding converts an octet (byte) to a percent sign (%) followed by two hexadecimal digits (e.g., space as %20), based on UTF-8 encoding for international characters outside the allowed set of unreserved characters (A-Z, a-z, 0-9, -, ., _, ~), sub-delimiters (!, $, &, ', (, ), *, +, ,, ;, =), and colon in specific contexts.[14][20] In the host's registered name, percent-encoding applies to non-ASCII characters after UTF-8 conversion, ensuring compatibility with ASCII-based systems like DNS.[22] For example, a domain with a space might appear as example%20host.com, though spaces are invalid in valid hostnames and should be avoided.[15] This encoding mechanism maintains the structural integrity of the authority during transmission and parsing.[14]
Path, Query, and Fragment
The path component of a URL specifies the hierarchical location of a resource within the scope defined by the scheme and authority, consisting of a sequence of path segments separated by forward slashes (/).[23] It may be absolute (starting with /), rootless (starting with a segment without leading /), or empty, where an empty path implies the root resource when an authority is present.[23] For example, in the URLhttps://example.com/wiki/Uniform_Resource_Locator, the path /wiki/Uniform_Resource_Locator identifies a resource hierarchically under the "wiki" directory.[23] Path segments can include dot-segments like "." (current directory) or ".." (parent directory), which are resolved and removed during URI normalization to avoid redundancy.[24]
The query component follows the path, delimited by a question mark (?), and provides optional, non-hierarchical parameters to further specify the resource or modify the request.[25] It is typically structured as key-value pairs separated by ampersands (&), though no universal format is mandated and implementations often define application-specific conventions, such as ?search=URL&sort=asc in https://example.com/search?search=URL&sort=asc.[25] The query allows characters from the path character set (pchar), including slashes (/) and question marks (?) as data, enabling flexible data transmission without implying hierarchy.[25]
The fragment identifier, introduced by a hash (#) after the query (or path if no query), serves as an intra-document reference to a secondary resource or specific portion of the primary resource retrieved by the URL.[26] It is processed client-side and not transmitted to the server during resource retrieval, facilitating navigation within documents, such as #introduction in https://example.com/document.[html](/page/HTML)#introduction to jump to a named section.[26] The fragment's interpretation depends on the media type of the resource, allowing formats like element IDs in HTML or byte offsets in other media.[26]
In the path and query components, reserved characters—such as /, ?, #, and others like :, @, and sub-delimiters (!, $, &, etc.)—must be percent-encoded (e.g., / as %2F) when used as data rather than delimiters to preserve structural integrity.[14] Percent-encoding represents octets as % followed by two hexadecimal digits (e.g., space as %20), while unreserved characters (letters, digits, -, ., _, ~) remain unencoded.[27] The fragment follows similar encoding rules, permitting / and ? as data, but decoding occurs after retrieval based on the resource's syntax.[26] These rules ensure unambiguous parsing across diverse systems.[28]
Variations and Extensions
Internationalized Resource Identifiers
Internationalized Resource Identifiers (IRIs) extend the Uniform Resource Identifier (URI) framework, including URLs, to support Unicode characters from natural languages beyond the limited US-ASCII set, enabling more intuitive resource identification in global contexts.[29] Defined in RFC 3987 (2005), an IRI is a sequence of Unicode characters that follows a syntax similar to URIs but allows non-ASCII characters in most components, with a bidirectional mapping to URIs for compatibility with existing protocols.[29] This extension addresses the limitations of ASCII-only URIs by permitting international scripts in identifiers while maintaining interoperability through standardized encoding.[29] For domain names within the authority component, IRIs incorporate Internationalized Domain Names (IDNs) using the Internationalizing Domain Names in Applications (IDNA) protocol, which maps Unicode domain labels to ASCII-compatible encodings for DNS resolution.[30] IDNA employs Punycode (RFC 3492), a bootstring encoding that transforms non-ASCII Unicode strings into ASCII strings prefixed with "xn--", preserving the original characters' order and allowing reversible decoding.[31] For example, the domain "café.com" is encoded as "xn--caf-dma.com", where "é" (U+00E9) becomes "dma" via delta-based encoding in base-36 representation.[31] The updated IDNA2008 specification (RFC 5890) refines these rules by rejecting unassigned code points and bypassing earlier string preparation steps, but retains Punycode for encoding U-labels into A-labels.[30] In the path and query components, IRIs allow direct use of Unicode characters, which are converted to URIs by first applying Unicode Normalization Form C (NFC) if necessary, then encoding the resulting string in UTF-8, and applying percent-encoding (%HH) to any non-ASCII octets.[29] For instance, the path segment "café" is UTF-8 encoded as the bytes C3 A9 for "é", then percent-encoded as "%C3%A9" in the URI form.[29] This process ensures that IRIs remain human-readable in their native scripts while producing valid URIs for transmission over ASCII-based networks.[29] Web browsers and user agents must support IRI-to-URI conversion for resolution, typically displaying IDNs in their native Unicode form when safe and converting to Punycode for DNS lookups.[32] Modern browsers like Chrome, Firefox, Safari, and Edge handle this by normalizing inputs per Unicode standards and applying IDNA mappings, though they require explicit protocol support for full IRI usage.[32] However, IDN support introduces risks such as homograph attacks, where visually similar characters from different scripts (e.g., Cyrillic "а" resembling Latin "a") enable phishing by spoofing legitimate domains like "apple.com" as "аpple.com".[33] To mitigate this, browsers implement policies like displaying Punycode for mixed-script or suspicious IDNs, using whitelists for trusted top-level domains, and alerting users to potential confusable characters.[32]Protocol-Relative and Relative URLs
Relative URLs are Uniform Resource Locators that omit certain components, such as the scheme or authority, and are resolved relative to a base URL, typically the URL of the current document or resource.[34] This form allows for more concise referencing of resources within the same context, reducing redundancy in markup languages like HTML and CSS.[35] According to RFC 3986, relative URLs fall into three main categories based on their starting structure: relative-path references (e.g.,sibling.html or ../parent/folder/), absolute-path references (e.g., /path/to/[resource](/page/Resource)), and network-path references (e.g., //example.com/path).[34]
Network-path references, commonly known as protocol-relative URLs, begin with // followed by an authority (host and optional port) and path, inheriting the scheme from the base URL.[34] For instance, on a page loaded via https://example.com, the reference //cdn.example.net/script.js resolves to [https](/page/HTTPS)://cdn.example.net/script.js.[36] This inheritance ensures the resource uses the same protocol as the base, which was historically useful for avoiding mixed-content warnings in environments transitioning between HTTP and HTTPS.
The resolution of both relative and protocol-relative URLs follows a standardized algorithm outlined in RFC 3986, which parses the base URL, applies the relative components, merges paths (handling dot-segments like . and .. to navigate hierarchies), and reconstructs the target URL.[37] For example, with a base URL of https://a.com/b/c/, the relative reference ../d?q#f resolves to https://a.com/b/d?q#f.[38] This process is implemented consistently in modern browsers via the WHATWG URL Standard, though older implementations occasionally varied in query and fragment handling.[39]
In practice, relative URLs are widely used in HTML attributes like href for internal links (e.g., <a href="/docs/section" rel="nofollow">) and src for images or scripts, enabling site portability without hardcoding full paths. Similarly, in CSS, they reference assets such as background images (e.g., background-image: url("../images/logo.png");) to maintain modularity across different deployment environments.[35] Protocol-relative URLs found particular application in the 2010s for loading third-party resources like CDNs (e.g., //ajax.googleapis.com/ajax/libs/jquery/), allowing seamless protocol switching without mixed-content blocks.
However, protocol-relative URLs carry limitations, as they cannot cross scheme boundaries—if the base uses HTTPS but the target authority does not support it, the request may fail or trigger redirects, leading to performance overhead.[36] They also inherit potential insecurities from the base scheme, such as loading over HTTP on non-secure pages, which exposes resources to interception.[40] In mixed-content contexts, where an HTTPS page attempts to load HTTP subresources, browsers block active content like scripts, though protocol-relative avoids this by matching the scheme—but only if the target enforces HTTPS.[41]
Post-2010s, with the widespread adoption of HTTPS Everywhere initiatives, protocol-relative URLs have become discouraged as an anti-pattern, as they can enable man-in-the-middle attacks if the initial connection lacks encryption and miss HTTPS-specific optimizations like HTTP/2.[42] Standards bodies now recommend explicit https:// schemes for external resources to ensure end-to-end security and reliability.[43] Browser handling has standardized under the WHATWG URL API, minimizing variations, but legacy systems or proxies may still interpret relative paths differently, particularly with non-ASCII characters or complex queries.[39] Relative URLs in general remain essential for internal navigation but should avoid cross-origin or cross-scheme scenarios to prevent resolution errors.[44]
Usage and Implementation
Parsing and Resolution Mechanisms
The parsing of a URL involves a state-based algorithm that decomposes the input string into components such as scheme, authority, path, query, and fragment, while applying normalization rules to ensure consistency. According to the WHATWG URL Standard, the process begins in the "scheme start state," where the input is checked for an initial ASCII alpha character to enter the "scheme state." In this state, the scheme is built by collecting lowercase alphanumeric characters, plus signs (+), hyphens (-), or periods (.), until a colon (:) is encountered, validating the scheme's format.[45] If no valid scheme is found and no base URL is provided (or the base has an opaque path), the parsing fails.[46] Following scheme validation, the parser transitions to handle the authority component in the "authority state," collecting username and password (if present) until an at-sign (@), then parsing the host until a slash (/), question mark (?), or end of input. Percent-encoding in the userinfo is applied using the userinfo percent-encode set to ensure safe transmission of special characters. The host is then parsed via a dedicated host parser, which supports IPv4, IPv6, and domain names, failing on invalid inputs like unbalanced brackets in IPv6 addresses. The path is processed in the "path state," where segments are split by slashes, with normalization applied: single dots (.) are ignored unless at the path's end, and double dots (..) shorten the path by removing the last segment. Backslashes () are replaced with forward slashes (/) for schemes like http or https.[45][47] Query and fragment handling occur after the path: a question mark (?) initiates the query string, which is percent-decoded using the query percent-encode set (UTF-8 decoding for valid %HH sequences, where H is a hexadecimal digit), and a hash (#) starts the fragment, decoded with the fragment percent-encode set. For example, the inputhttps://[example.com](/page/Example.com)/?q=test%20value#section results in a query of "q=test value" and fragment "section" after decoding the space (%20). The entire process uses UTF-8 percent-decoding, rejecting invalid sequences and ensuring the URL is in a canonical form suitable for resource access.[45][48]
URL resolution extends parsing by constructing an absolute URL from a relative reference and a base URL, following rules that preserve the base's scheme, host, and port while appending or modifying the relative components. The WHATWG standard specifies that if the input lacks a scheme, it copies the base's scheme and authority, then resolves the path relative to the base path: for instance, resolving ./foo against base http://[example.com](/page/Example.com)/bar/ yields http://[example.com](/page/Example.com)/foo by navigating up one directory and appending "foo." If the relative URL starts with //, it adopts the base scheme but uses the new authority; a scheme-present relative URL (e.g., ftp://...) overrides the base entirely. This mechanism ensures hierarchical navigation, with path normalization applied post-resolution to handle dots and remove redundant slashes.[39][49]
In programming implementations, high-level APIs facilitate parsing and resolution while adhering to these standards. The JavaScript URL API, part of the Web API, allows construction via the URL constructor: new URL("https://example.com/path?query=value#frag") parses the string into an object with properties like pathname ("/path"), search ("?query=value"), and hash ("#frag"), enabling read/write access. For resolution, new URL("./relative", "http://base.com/dir/") produces "http://base.com/relative" after path normalization. Invalid inputs throw a TypeError.[50]
Similarly, Python's urllib.parse module provides urlparse for decomposition: urlparse("http://[example.com](/page/Example.com):80/path?query#frag") returns a ParseResult tuple with scheme='http', netloc='[example.com](/page/Example.com):80', path='/path', query='query', and fragment='frag', automatically extracting the port (80 as default for HTTP). Resolution uses urljoin("http://base.com/dir/", "./relative"), yielding "http://base.com/relative" by combining and normalizing paths per RFC 3986. These libraries handle percent-decoding internally, with ValueError raised for malformed URLs like invalid ports.[51][52]
Edge cases in parsing and resolution require careful handling to maintain robustness. Invalid URLs, such as those with unrecognized schemes or malformed hosts (e.g., https://[invalid]), result in parsing failure per the WHATWG algorithm, often throwing exceptions in APIs like JavaScript's TypeError or Python's ValueError. Default ports are implicitly applied during authority parsing—80 for HTTP and 443 for HTTPS—unless explicitly specified, allowing omission in the string (e.g., http://example.com resolves to port 80). IPv6 literals must be enclosed in square brackets, as in https://[::1]:8080/, with the host parser validating bracket matching and rejecting unpaired ones; Python's urllib.parse supports this since version 3.2, extracting the address correctly from netloc.[45][53][54]
Security and Best Practices
URLs present several security risks when not handled properly, particularly in web applications where user input can influence navigation or content rendering. One common threat is open redirects, where attackers manipulate redirect parameters to send users to malicious sites, often facilitating phishing by mimicking legitimate domains. For instance, an unvalidated redirect URL likehttps://example.com/redirect?url=http://malicious-site.com can bypass filters if the application fails to verify the target domain against a whitelist. Cross-site scripting (XSS) attacks can also exploit URL components, such as unescaped query parameters or fragments; if a fragment like #<script>alert('xss')</script> is reflected into the page without sanitization, it may execute malicious JavaScript in the browser context, especially in DOM-based scenarios where client-side code processes the URL. Additionally, Internationalized Resource Identifiers (IRIs) and protocol-relative URLs can serve as vectors for attacks if not normalized, potentially enabling homograph spoofs or unintended scheme assumptions. Internationalized domain name (IDN) homograph attacks further compound these issues by using visually similar Unicode characters to impersonate trusted sites, tricking users into visiting fraudulent domains like xn--pple-43d.com (appearing as "apple.com").
To mitigate these threats, robust URL validation is essential, starting with whitelisting allowed schemes such as http, https, and mailto to prevent execution of dangerous protocols like javascript: or data:. Percent-decoding should occur only after complete parsing to avoid double-decoding vulnerabilities, where attackers encode payloads twice (e.g., %253cscript%253e decoding to <script>) to evade filters; libraries adhering to RFC 3986 ensure safe handling by decoding in context. Enforcing HTTPS for all resources is a critical best practice, redirecting HTTP requests to secure equivalents and leveraging browser features like HTTP Strict Transport Security (HSTS) to prevent downgrade attacks. Modern browsers have increasingly adopted secure-by-default policies post-2020, with Chrome planning to enable "HTTPS-First Mode" by default for public sites starting in October 2026 (with Chrome versions released then). As of November 2025, it is enabled by default in Incognito mode since Chrome 127 (2024) and remains opt-in for regular browsing, automatically upgrading insecure connections where possible.
Sanitization techniques further strengthen defenses by avoiding the deprecated "user:password" format in the userinfo subcomponent (e.g., https://user:[email protected]), which can expose credentials in logs or referrals. Per RFC 3986, this format is deprecated for security reasons, and modern implementations typically do not support or use userinfo for authentication. Canonicalization normalizes URLs to prevent bypasses, such as converting %u003c (overlong UTF-16 encoding for <) to its standard form and resolving equivalent representations like multiple slashes (///) to a single one, reducing ambiguity exploited in server-side request forgery (SSRF). The OWASP Application Security Verification Standard (ASVS) recommends these practices, emphasizing context-aware encoding for dynamic URL construction and regular audits for parser inconsistencies across components.
Modern Applications
URLs in APIs and Web Services
In RESTful APIs, URLs function as the primary means of identifying and accessing resources, serving as endpoints that encapsulate the API's structure and enable standardized interactions. According to the REST architectural style outlined by Roy Fielding, resources are named using uniform resource identifiers (URIs), such as URLs, to maintain a stateless, cacheable interface where HTTP methods like GET, POST, PUT, and DELETE operate on specific paths. For instance, a URL likehttps://api.example.com/users/{id} represents a unique user resource, with {id} as a path parameter that allows precise targeting without embedding state in the URI itself. This approach promotes scalability by decoupling clients from server implementations, relying on the URL's hierarchical structure to reflect resource relationships.[55]
Query parameters further enhance URL expressiveness in web services, allowing dynamic modification of requests for tasks like pagination, sorting, and filtering without altering the core endpoint. Common examples include ?page=2 to retrieve the second page of results in a paginated list or ?category=tech&sort=desc to filter and order items by technology category in descending order. The OpenAPI Specification standardizes the documentation of these parameters, defining them with attributes like type, default, and enum to specify valid values, ensuring interoperability across tools and clients. For arrays or objects in queries, serialization styles such as form or spaceDelimited handle complex data, as seen in filtering operations that pass structured criteria like ?filter[status]=active. This practice, rooted in HTTP conventions, optimizes data retrieval efficiency in large-scale services.[56]
In Web3 and decentralized architectures, URLs extend traditional schemes to support content-addressed and blockchain-integrated identifiers, facilitating peer-to-peer interactions. The InterPlanetary File System (IPFS) employs the ipfs:// scheme followed by a Content Identifier (CID), such as ipfs://QmPK1s3pNYLiq9ERiq3BDxKa4XosgWwFRQUydHUtz4YgpqB, to reference immutable files distributed across nodes, verified via cryptographic hashes like SHA-256. Complementing this, the Ethereum Name Service (ENS) maps human-readable names like vitalik.eth to Ethereum addresses or content hashes, enabling URL resolution for decentralized applications (dApps); for example, vitalik.eth can link to an IPFS-hosted site accessible via gateways like vitalik.eth.limo. These mechanisms integrate blockchain identifiers into URL patterns, allowing seamless navigation in ecosystems where central authority is absent.[57][58]
Microservices architectures leverage URL routing to direct traffic across distributed services, with load balancers distributing requests based on path patterns to ensure high availability and fault tolerance. In setups like those using Google Cloud Load Balancing, URL maps route requests—such as /orders to an order-processing service—while applying rules for host, path, and headers to balance load via methods like round-robin or weighted distribution. Post-2015, the evolution of serverless computing has amplified this through platforms like AWS API Gateway, launched in 2015, which dynamically routes URLs to Lambda functions, handling HTTP endpoints with features like throttling and authorization for event-driven, scalable APIs without infrastructure management. This shift has enabled microservices to operate in fully managed environments, where URL patterns trigger serverless executions across global edges.[59]
