`document.htmlparse.lxml_` Module¶

Parsing using lxml and libxml2.

class wpull.document.htmlparse.lxml_.HTMLParser[source]¶

HTML document parser.

This reader uses lxml as the parser.

classmethod detect_parser_type(file, encoding=None)[source]¶

Get the suitable parser type for the document.

Returns:	str

classmethod parse_doctype(file, encoding=None)[source]¶

Get the doctype from the document.

Returns:	str, None

parse_lxml(file, encoding=None, target_class=<class 'wpull.document.htmlparse.lxml_.HTMLParserTarget'>, parser_type='html')[source]¶

Return an iterator of elements found in the document.

Parameters:	file – A file object containing the document. encoding (str) – The encoding of the document. target_class – A class to be used for target parsing. parser_type (str) – The type of parser to use. Accepted values: `html`, `xhtml`, `xml`.
Returns:	Each item is an element from `document.htmlparse.element`
Return type:	iterator

class wpull.document.htmlparse.lxml_.HTMLParserTarget(callback)[source]¶

Bases: object

An HTML parser target.

Parameters:	callback – A callback function. The function should accept one argument from `document.htmlparse.element`.

wpull.document.htmlparse.lxml_.to_lxml_encoding(encoding)[source]¶

Check if lxml supports the specified encoding.

Returns:	str, None

document.htmlparse.lxml_ Module¶