`scraper.html` Module¶

HTML link extractor.

class wpull.scraper.html.ElementWalker(css_scraper=None, javascript_scraper=None)[source]¶

Bases: object

ATTR_HTML = 2¶: Flag for links that point to other documents.

ATTR_INLINE = 1¶: Flag for embedded objects (like images, stylesheets) in documents.

DYNAMIC_ATTRIBUTES = ('onkey', 'oncli', 'onmou')¶: Attributes that contain JavaScript.

LINK_ATTRIBUTES = frozenset({'action', 'href', 'usemap', 'cite', 'background', 'dynsrc', 'lowsrc', 'data', 'profile', 'archive', 'codebase', 'longdesc', 'classid', 'src'})¶: HTML element attributes that may contain links.

OPEN_GRAPH_LINK_NAMES = ('og:url', 'twitter:player')¶

Iterate elements looking for links.

Parameters:	css_scraper (`scraper.css.CSSScraper`) – Optional CSS scraper. ( (javascript_scraper) – class:`.scraper.javascript.JavaScriptScraper): Optional JavaScript scraper.

OPEN_GRAPH_MEDIA_NAMES = ('og:image', 'og:audio', 'og:video', 'twitter:image:src', 'twitter:image0', 'twitter:image1', 'twitter:image2', 'twitter:image3', 'twitter:player:stream')¶

TAG_ATTRIBUTES = {'th': {'background': 1}, 'table': {'background': 1}, 'body': {'background': 1}, 'fig': {'src': 1}, 'bgsound': {'src': 1}, 'embed': {'href': 2, 'src': 3}, 'overlay': {'src': 3}, 'area': {'href': 2}, 'input': {'src': 1}, 'layer': {'src': 3}, 'a': {'href': 2}, 'iframe': {'src': 3}, 'frame': {'src': 3}, 'applet': {'code': 1}, 'img': {'href': 1, 'lowsrc': 1, 'src': 1}, 'td': {'background': 1}, 'form': {'action': 2}, 'script': {'src': 1}, 'object': {'data': 1}}¶: Mapping of element tag names to attributes containing links.

classmethod is_html_link(tag, attribute)[source]¶: Return whether the link is likely to be external object.

classmethod is_link_inline(tag, attribute)[source]¶: Return whether the link is likely to be inline object.

iter_links(elements)[source]¶

Iterate the document root for links.

Returns:	A iterator of `LinkedInfo`.
Return type:	iterable

iter_links_by_attrib(element)[source]¶: Iterate an element by looking at its attributes for links.

iter_links_by_js_attrib(attrib_name, attrib_value)[source]¶: Iterate links of a JavaScript pseudo-link attribute.

classmethod iter_links_by_srcset_attrib(attrib_name, attrib_value)[source]¶

iter_links_element(element)[source]¶: Iterate a HTML element.

classmethod iter_links_element_text(element)[source]¶: Get the element text as a link.

iter_links_link_element(element)[source]¶

Iterate a link for URLs.

This function handles stylesheets and icons in addition to standard scraping rules.

classmethod iter_links_meta_element(element)[source]¶

Iterate the meta element for links.

This function handles refresh URLs.

classmethod iter_links_object_element(element)[source]¶

Iterate object and embed elements.

This function also looks at codebase and archive attributes.

classmethod iter_links_open_graph_meta(element)[source]¶

classmethod iter_links_param_element(element)[source]¶: Iterate a param element.

iter_links_plain_element(element)[source]¶: Iterate any element for links using generic rules.

iter_links_script_element(element)[source]¶: Iterate a script element.

iter_links_style_element(element)[source]¶: Iterate a style element.

classmethod robots_cannot_follow(element)[source]¶: Return whether we cannot follow links due to robots.txt directives.

class wpull.scraper.html.HTMLScraper(html_parser, element_walker, followed_tags=None, ignored_tags=None, robots=False, only_relative=False, encoding_override=None)[source]¶

Bases: wpull.document.html.HTMLReader, wpull.scraper.base.BaseHTMLScraper

Scraper for HTML documents.

Parameters:

(class (element_walker) – .document.htmlparse.base.BaseParser): An HTML parser such as the lxml or html5lib one.
(class – ElementWalker): HTML element walker.
followed_tags – A list of tags that should be scraped
ignored_tags – A list of tags that should not be scraped
robots – If True, discard any links if they cannot be followed
only_relative – If True, discard any links that are not absolute paths

scrape(request, response, link_type=None)[source]¶

scrape_file(file, encoding=None, base_url=None)[source]¶

Scrape a file for links.

See scrape() for the return value.

wpull.scraper.html.LinkInfo¶

Information about a link in a lxml document.

wpull.scraper.html.element¶: An instance of document.HTMLReadElement.

wpull.scraper.html.tag¶

str

The element tag name.

wpull.scraper.html.attrib¶

str, None

If str, the name of the attribute. Otherwise, the link was found in element.text.

wpull.scraper.html.link¶

str

The link found.

wpull.scraper.html.inline¶

bool

Whether the link is an embedded object (like images or stylesheets).

wpull.scraper.html.linked¶

bool

Whether the link is a link to another page.

wpull.scraper.html.base_link¶

str, None

The base URL.

wpull.scraper.html.value_type¶

str

Indicates how the link was found. Possible values are

plain: The link was found plainly in an attribute value.
list: The link was found in a space separated list.
css: The link was found in a CSS text.
refresh: The link was found in a refresh meta string.
script: The link was found in JavaScript text.
srcset: The link was found in a srcset attribute.

wpull.scraper.html.link_type¶: A value from item.LinkInfo.

alias of LinkInfoType

scraper.html Module¶

`scraper.html` Module¶