document.htmlparse.lxml_ Module

Parsing using lxml and libxml2.

class wpull.document.htmlparse.lxml_.HTMLParser[source]

Bases: wpull.document.htmlparse.base.BaseParser

HTML document parser.

This reader uses lxml as the parser.

BUFFER_SIZE = 131072
classmethod detect_parser_type(file, encoding=None)[source]

Get the suitable parser type for the document.

Returns:str
parse(file, encoding=None)[source]
classmethod parse_doctype(file, encoding=None)[source]

Get the doctype from the document.

Returns:str, None
parse_lxml(file, encoding=None, target_class=<class 'wpull.document.htmlparse.lxml_.HTMLParserTarget'>, parser_type='html')[source]

Return an iterator of elements found in the document.

Parameters:
  • file – A file object containing the document.
  • encoding (str) – The encoding of the document.
  • target_class – A class to be used for target parsing.
  • parser_type (str) – The type of parser to use. Accepted values: html, xhtml, xml.
Returns:

Each item is an element from document.htmlparse.element

Return type:

iterator

parser_error
class wpull.document.htmlparse.lxml_.HTMLParserTarget(callback)[source]

Bases: object

An HTML parser target.

Parameters:callback – A callback function. The function should accept one argument from document.htmlparse.element.
close()[source]
comment(text)[source]
data(data)[source]
end(tag)[source]
start(tag, attrib)[source]
wpull.document.htmlparse.lxml_.to_lxml_encoding(encoding)[source]

Check if lxml supports the specified encoding.

Returns:str, None