document.htmlparse.lxml_ Module¶
Parsing using lxml and libxml2.
-
class
wpull.document.htmlparse.lxml_.HTMLParser[source]¶ Bases:
wpull.document.htmlparse.base.BaseParserHTML document parser.
This reader uses lxml as the parser.
-
BUFFER_SIZE= 131072¶
-
classmethod
detect_parser_type(file, encoding=None)[source]¶ Get the suitable parser type for the document.
Returns: str
-
classmethod
parse_doctype(file, encoding=None)[source]¶ Get the doctype from the document.
Returns: str, None
-
parse_lxml(file, encoding=None, target_class=<class 'wpull.document.htmlparse.lxml_.HTMLParserTarget'>, parser_type='html')[source]¶ Return an iterator of elements found in the document.
Parameters: - file – A file object containing the document.
- encoding (str) – The encoding of the document.
- target_class – A class to be used for target parsing.
- parser_type (str) – The type of parser to use. Accepted values:
html,xhtml,xml.
Returns: Each item is an element from
document.htmlparse.elementReturn type: iterator
-
parser_error¶
-
-
class
wpull.document.htmlparse.lxml_.HTMLParserTarget(callback)[source]¶ Bases:
objectAn HTML parser target.
Parameters: callback – A callback function. The function should accept one argument from document.htmlparse.element.