scraper.html Module

HTML link extractor.

class wpull.scraper.html.ElementWalker(css_scraper=None, javascript_scraper=None)[source]

Bases: object

ATTR_HTML = 2

Flag for links that point to other documents.

ATTR_INLINE = 1

Flag for embedded objects (like images, stylesheets) in documents.

DYNAMIC_ATTRIBUTES = ('onkey', 'oncli', 'onmou')

Attributes that contain JavaScript.

HTML element attributes that may contain links.

Iterate elements looking for links.

Parameters:
  • css_scraper (scraper.css.CSSScraper) – Optional CSS scraper.
  • ( (javascript_scraper) – class:`.scraper.javascript.JavaScriptScraper): Optional JavaScript scraper.
OPEN_GRAPH_MEDIA_NAMES = ('og:image', 'og:audio', 'og:video', 'twitter:image:src', 'twitter:image0', 'twitter:image1', 'twitter:image2', 'twitter:image3', 'twitter:player:stream')
TAG_ATTRIBUTES = {'th': {'background': 1}, 'table': {'background': 1}, 'body': {'background': 1}, 'fig': {'src': 1}, 'bgsound': {'src': 1}, 'embed': {'href': 2, 'src': 3}, 'overlay': {'src': 3}, 'area': {'href': 2}, 'input': {'src': 1}, 'layer': {'src': 3}, 'a': {'href': 2}, 'iframe': {'src': 3}, 'frame': {'src': 3}, 'applet': {'code': 1}, 'img': {'href': 1, 'lowsrc': 1, 'src': 1}, 'td': {'background': 1}, 'form': {'action': 2}, 'script': {'src': 1}, 'object': {'data': 1}}

Mapping of element tag names to attributes containing links.

Return whether the link is likely to be external object.

Return whether the link is likely to be inline object.

Iterate the document root for links.

Returns:A iterator of LinkedInfo.
Return type:iterable

Iterate an element by looking at its attributes for links.

Iterate links of a JavaScript pseudo-link attribute.

Iterate a HTML element.

Get the element text as a link.

Iterate a link for URLs.

This function handles stylesheets and icons in addition to standard scraping rules.

Iterate the meta element for links.

This function handles refresh URLs.

Iterate object and embed elements.

This function also looks at codebase and archive attributes.

Iterate a param element.

Iterate any element for links using generic rules.

Iterate a script element.

Iterate a style element.

classmethod robots_cannot_follow(element)[source]

Return whether we cannot follow links due to robots.txt directives.

class wpull.scraper.html.HTMLScraper(html_parser, element_walker, followed_tags=None, ignored_tags=None, robots=False, only_relative=False, encoding_override=None)[source]

Bases: wpull.document.html.HTMLReader, wpull.scraper.base.BaseHTMLScraper

Scraper for HTML documents.

Parameters:
  • (class (element_walker) – .document.htmlparse.base.BaseParser): An HTML parser such as the lxml or html5lib one.
  • (classElementWalker): HTML element walker.
  • followed_tags – A list of tags that should be scraped
  • ignored_tags – A list of tags that should not be scraped
  • robots – If True, discard any links if they cannot be followed
  • only_relative – If True, discard any links that are not absolute paths
scrape(request, response, link_type=None)[source]
scrape_file(file, encoding=None, base_url=None)[source]

Scrape a file for links.

See scrape() for the return value.

wpull.scraper.html.LinkInfo

Information about a link in a lxml document.

wpull.scraper.html.element

An instance of document.HTMLReadElement.

wpull.scraper.html.tag

str

The element tag name.

wpull.scraper.html.attrib

str, None

If str, the name of the attribute. Otherwise, the link was found in element.text.

str

The link found.

wpull.scraper.html.inline

bool

Whether the link is an embedded object (like images or stylesheets).

wpull.scraper.html.linked

bool

Whether the link is a link to another page.

str, None

The base URL.

wpull.scraper.html.value_type

str

Indicates how the link was found. Possible values are

  • plain: The link was found plainly in an attribute value.
  • list: The link was found in a space separated list.
  • css: The link was found in a CSS text.
  • refresh: The link was found in a refresh meta string.
  • script: The link was found in JavaScript text.
  • srcset: The link was found in a srcset attribute.

A value from item.LinkInfo.

alias of LinkInfoType