urlfilter Module¶
URL filters.
-
class
wpull.urlfilter.BackwardDomainFilter(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterReturn whether the hostname matches a list of hostname suffixes.
-
class
wpull.urlfilter.BackwardFilenameFilter(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterFilter URLs that match the filename suffixes.
-
class
wpull.urlfilter.BaseURLFilter[source]¶ Bases:
objectBase class for URL filters.
The Processor uses filters to determine whether a URL should be downloaded.
-
class
wpull.urlfilter.DemuxURLFilter(url_filters: typing.Iterator)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterPuts multiple url filters into one.
-
test_info(url_info, url_table_record) → dict[source]¶ Returns info about which filters passed or failed.
Returns: A dict containing the keys: verdict(bool): Whether all the tests passed.passed(set): A set of URLFilters that passed.failed(set): A set of URLFilters that failed.map(dict): A mapping from URLFilter class name (str) to the verdict (bool).
Return type: dict
-
url_filters¶
-
-
class
wpull.urlfilter.DirectoryFilter(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterFilter URLs that match a directory path part.
-
class
wpull.urlfilter.FollowFTPFilter(follow=False)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterFollow links to FTP URLs.
-
class
wpull.urlfilter.HTTPSOnlyFilter[source]¶ Bases:
wpull.urlfilter.BaseURLFilterAllow URL if the URL is HTTPS.
-
class
wpull.urlfilter.HostnameFilter(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterReturn whether the hostname matches exactly in a list.
-
class
wpull.urlfilter.LevelFilter(max_depth, inline_max_depth=5)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterAllow URLs up to a level of recursion.
-
class
wpull.urlfilter.ParentFilter[source]¶ Bases:
wpull.urlfilter.BaseURLFilterFilter URLs that descend up parent paths.
-
class
wpull.urlfilter.RecursiveFilter(enabled=False, page_requisites=False)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterReturn
Trueif recursion is used.
-
class
wpull.urlfilter.RegexFilter(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterFilter URLs that match a regular expression.
-
class
wpull.urlfilter.SchemeFilter(allowed=('http', 'https', 'ftp'))[source]¶ Bases:
wpull.urlfilter.BaseURLFilterAllow URL if the URL is in list.
-
class
wpull.urlfilter.SpanHostsFilter(hostnames, enabled=False, page_requisites=False, linked_pages=False)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterFilter URLs that go to other hostnames.
-
class
wpull.urlfilter.TriesFilter(max_tries)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterAllow URLs that have been attempted up to a limit of tries.