uritools — URI parsing, classification and composition

This module provides RFC 3986 compliant functions for parsing, classifying and composing URIs and URI references, largely replacing the Python Standard Library’s urllib.parse module.

>>> from uritools import uricompose, urijoin, urisplit, uriunsplit
>>> uricompose(scheme='foo', host='example.com', port=8042,
...            path='/over/there', query={'name': 'ferret'},
...            fragment='nose')
'foo://example.com:8042/over/there?name=ferret#nose'
>>> parts = urisplit(_)
>>> parts.scheme
'foo'
>>> parts.authority
'example.com:8042'
>>> parts.getport(default=80)
8042
>>> parts.getquerydict().get('name')
['ferret']
>>> parts.isuri()
True
>>> parts.isabsuri()
False
>>> urijoin(uriunsplit(parts), '/right/here?name=swallow#beak')
'foo://example.com:8042/right/here?name=swallow#beak'

For various reasons, urllib.parse and its Python 2 predecessor urlparse are not compliant with current Internet standards. As stated in Lib/urllib/parse.py:

RFC 3986 is considered the current standard and any future changes to urlparse module should conform with it. The urlparse module is currently not entirely compliant with this RFC due to defacto scenarios for parsing, and for backward compatibility purposes, some parsing quirks from older RFCs are retained.

This module aims to provide fully RFC 3986 compliant replacements for the most commonly used functions found in urllib.parse. It also includes functions for distinguishing between the different forms of URIs and URI references, and for conveniently creating URIs from their individual components.

See also

RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax

The current Internet standard (STD66) defining URI syntax, to which any changes to uritools should conform. If deviations are observed, the module’s implementation should be changed, even if this means breaking backward compatibility.

URI Classification

According to RFC 3986, a URI reference is either a URI or a relative reference. If the URI reference’s prefix does not match the syntax of a scheme followed by its colon separator, then the URI reference is a relative reference.

A relative reference that begins with two slash characters is termed a network-path reference. A relative reference that begins with a single slash character is termed an absolute-path reference. A relative reference that does not begin with a slash character is termed a relative-path reference.

When a URI reference refers to a URI that is, aside from its fragment component, identical to the base URI, that reference is called a same-document reference. Examples of same-document references are relative references that are empty or include only the number sign (“#”) separator followed by a fragment identifier.

A URI without a fragment identifier is termed an absolute URI. A base URI, for example, must be an absolute URI. If the base URI is obtained from a URI reference, then that reference must be stripped of any fragment component prior to its use as a base URI.

URI Composition

URI Decomposition

URI Encoding

Structured Parse Results

The result objects from the uridefrag() and urisplit() functions are instances of subclasses of collections.namedtuple. These objects contain the attributes described in the function documentation, as well as some additional convenience methods.

Character Constants

uritools.GEN_DELIMS

A string containing all general delimiting characters specified in RFC 3986.

uritools.RESERVED

A string containing all reserved characters specified in RFC 3986.

uritools.SUB_DELIMS

A string containing all subcomponent delimiting characters specified in RFC 3986.

uritools.UNRESERVED

A string containing all unreserved characters specified in RFC 3986.