document.base Module¶
Document bases.
-
class
wpull.document.base.BaseDocumentDetector[source]¶ Bases:
objectBase class for classes that detect document types.
-
classmethod
is_file(file)[source]¶ Return whether the reader is likely able to read the file.
Parameters: file – A file object containing the document. Returns: bool
-
classmethod
is_request(request)[source]¶ Return whether the request is likely supported.
Parameters: request ( http.request.Request) – An HTTP request.Returns: bool
-
classmethod
is_response(response)[source]¶ Return whether the response is likely able to be read.
Parameters: response ( http.request.Response) – An HTTP response.Returns: bool
-
classmethod
is_supported(file=None, request=None, response=None, url_info=None)[source]¶ Given the hints, return whether the document is supported.
Parameters: - file – A file object containing the document.
- request (
http.request.Request) – An HTTP request. - response (
http.request.Response) – An HTTP response. - url_info (
url.URLInfo) – A URLInfo.
Returns: If True, the reader should be able to read it.
Return type: bool
-
classmethod
is_url(url_info)[source]¶ Return whether the URL is likely to be supported.
Parameters: url_info ( url.URLInfo) – A URLInfo.Returns: bool
-
classmethod
-
class
wpull.document.base.BaseExtractiveReader[source]¶ Bases:
objectBase class for document readers that can only extract links.
-
class
wpull.document.base.BaseHTMLReader[source]¶ Bases:
objectBase class for document readers for handling SGML-like documents.
-
iter_elements(file, encoding=None)[source]¶ Return an iterator of elements found in the document.
Parameters: - file – A file object containing the document.
- encoding (str) – The encoding of the document.
Returns: Each item is an element from
document.htmlparse.elementReturn type: iterator
-
-
class
wpull.document.base.BaseTextStreamReader[source]¶ Bases:
objectBase class for document readers that filters link and non-link text.
-
iter_links(file, encoding=None, context=False)[source]¶ Return the links.
This function is a convenience function for calling
iter_text()and returning only the links.
-
iter_text(file, encoding=None)[source]¶ Return the file text and links.
Parameters: - file – A file object containing the document.
- encoding (str) – The encoding of the document.
Returns: Each item is a tuple:
- str: The text
- bool (or truthy value): Whether the text is a likely a link. If truthy value may be provided containing additional context of the link.
Return type: iterator
The links returned are raw text and will require further processing.
-
-
wpull.document.base.VeryFalse= <wpull.document.base.VeryFalseType object>¶ Document is not definitely supported.