urlfilter Module

URL filters.

class wpull.urlfilter.BackwardDomainFilter(accepted=None, rejected=None)[source]

Bases: wpull.urlfilter.BaseURLFilter

Return whether the hostname matches a list of hostname suffixes.

classmethod match(domain_list, test_domain)[source]
test(url_info, url_table_record)[source]
class wpull.urlfilter.BackwardFilenameFilter(accepted=None, rejected=None)[source]

Bases: wpull.urlfilter.BaseURLFilter

Filter URLs that match the filename suffixes.

classmethod match(suffix_list, test_filename)[source]
test(url_info, url_table_record)[source]
class wpull.urlfilter.BaseURLFilter[source]

Bases: object

Base class for URL filters.

The Processor uses filters to determine whether a URL should be downloaded.

test(url_info: wpull.url.URLInfo, url_record: wpull.pipeline.item.URLRecord) → bool[source]

Return whether the URL should be downloaded.

Parameters:
  • url_info – URL to be tested.
  • url_record – Fetch metadata about the URL.
Returns:

If True, the filter passed and the URL should be downloaded.

class wpull.urlfilter.DemuxURLFilter(url_filters: typing.Iterator)[source]

Bases: wpull.urlfilter.BaseURLFilter

Puts multiple url filters into one.

test(url_info, url_table_record)[source]
test_info(url_info, url_table_record) → dict[source]

Returns info about which filters passed or failed.

Returns:A dict containing the keys:
  • verdict (bool): Whether all the tests passed.
  • passed (set): A set of URLFilters that passed.
  • failed (set): A set of URLFilters that failed.
  • map (dict): A mapping from URLFilter class name (str) to the verdict (bool).
Return type:dict
url_filters
class wpull.urlfilter.DirectoryFilter(accepted=None, rejected=None)[source]

Bases: wpull.urlfilter.BaseURLFilter

Filter URLs that match a directory path part.

test(url_info, url_table_record)[source]
class wpull.urlfilter.FollowFTPFilter(follow=False)[source]

Bases: wpull.urlfilter.BaseURLFilter

Follow links to FTP URLs.

test(url_info, url_table_record)[source]
class wpull.urlfilter.HTTPSOnlyFilter[source]

Bases: wpull.urlfilter.BaseURLFilter

Allow URL if the URL is HTTPS.

test(url_info, url_table_record)[source]
class wpull.urlfilter.HostnameFilter(accepted=None, rejected=None)[source]

Bases: wpull.urlfilter.BaseURLFilter

Return whether the hostname matches exactly in a list.

test(url_info, url_table_record)[source]
class wpull.urlfilter.LevelFilter(max_depth, inline_max_depth=5)[source]

Bases: wpull.urlfilter.BaseURLFilter

Allow URLs up to a level of recursion.

test(url_info, url_table_record)[source]
class wpull.urlfilter.ParentFilter[source]

Bases: wpull.urlfilter.BaseURLFilter

Filter URLs that descend up parent paths.

test(url_info, url_table_record)[source]
class wpull.urlfilter.RecursiveFilter(enabled=False, page_requisites=False)[source]

Bases: wpull.urlfilter.BaseURLFilter

Return True if recursion is used.

test(url_info, url_table_record)[source]
class wpull.urlfilter.RegexFilter(accepted=None, rejected=None)[source]

Bases: wpull.urlfilter.BaseURLFilter

Filter URLs that match a regular expression.

test(url_info, url_table_record)[source]
class wpull.urlfilter.SchemeFilter(allowed=('http', 'https', 'ftp'))[source]

Bases: wpull.urlfilter.BaseURLFilter

Allow URL if the URL is in list.

test(url_info, url_table_record)[source]
class wpull.urlfilter.SpanHostsFilter(hostnames, enabled=False, page_requisites=False, linked_pages=False)[source]

Bases: wpull.urlfilter.BaseURLFilter

Filter URLs that go to other hostnames.

test(url_info, url_table_record)[source]
class wpull.urlfilter.TriesFilter(max_tries)[source]

Bases: wpull.urlfilter.BaseURLFilter

Allow URLs that have been attempted up to a limit of tries.

test(url_info, url_table_record)[source]