urlfilter
Module¶
URL filters.
-
class
wpull.urlfilter.
BackwardDomainFilter
(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Return whether the hostname matches a list of hostname suffixes.
-
class
wpull.urlfilter.
BackwardFilenameFilter
(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Filter URLs that match the filename suffixes.
-
class
wpull.urlfilter.
BaseURLFilter
[source]¶ Bases:
object
Base class for URL filters.
The Processor uses filters to determine whether a URL should be downloaded.
-
class
wpull.urlfilter.
DemuxURLFilter
(url_filters: typing.Iterator)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Puts multiple url filters into one.
-
test_info
(url_info, url_table_record) → dict[source]¶ Returns info about which filters passed or failed.
Returns: A dict containing the keys: verdict
(bool): Whether all the tests passed.passed
(set): A set of URLFilters that passed.failed
(set): A set of URLFilters that failed.map
(dict): A mapping from URLFilter class name (str) to the verdict (bool).
Return type: dict
-
url_filters
¶
-
-
class
wpull.urlfilter.
DirectoryFilter
(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Filter URLs that match a directory path part.
-
class
wpull.urlfilter.
FollowFTPFilter
(follow=False)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Follow links to FTP URLs.
-
class
wpull.urlfilter.
HTTPSOnlyFilter
[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Allow URL if the URL is HTTPS.
-
class
wpull.urlfilter.
HostnameFilter
(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Return whether the hostname matches exactly in a list.
-
class
wpull.urlfilter.
LevelFilter
(max_depth, inline_max_depth=5)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Allow URLs up to a level of recursion.
-
class
wpull.urlfilter.
ParentFilter
[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Filter URLs that descend up parent paths.
-
class
wpull.urlfilter.
RecursiveFilter
(enabled=False, page_requisites=False)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Return
True
if recursion is used.
-
class
wpull.urlfilter.
RegexFilter
(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Filter URLs that match a regular expression.
-
class
wpull.urlfilter.
SchemeFilter
(allowed=('http', 'https', 'ftp'))[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Allow URL if the URL is in list.
-
class
wpull.urlfilter.
SpanHostsFilter
(hostnames, enabled=False, page_requisites=False, linked_pages=False)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Filter URLs that go to other hostnames.
-
class
wpull.urlfilter.
TriesFilter
(max_tries)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Allow URLs that have been attempted up to a limit of tries.