scraper.html Module¶
HTML link extractor.
-
class
wpull.scraper.html.ElementWalker(css_scraper=None, javascript_scraper=None)[source]¶ Bases:
object-
ATTR_HTML= 2¶ Flag for links that point to other documents.
-
ATTR_INLINE= 1¶ Flag for embedded objects (like images, stylesheets) in documents.
-
DYNAMIC_ATTRIBUTES= ('onkey', 'oncli', 'onmou')¶ Attributes that contain JavaScript.
-
LINK_ATTRIBUTES= frozenset({'action', 'href', 'usemap', 'cite', 'background', 'dynsrc', 'lowsrc', 'data', 'profile', 'archive', 'codebase', 'longdesc', 'classid', 'src'})¶ HTML element attributes that may contain links.
-
OPEN_GRAPH_LINK_NAMES= ('og:url', 'twitter:player')¶ Iterate elements looking for links.
Parameters: - css_scraper (
scraper.css.CSSScraper) – Optional CSS scraper. - ( (javascript_scraper) – class:`.scraper.javascript.JavaScriptScraper): Optional JavaScript scraper.
- css_scraper (
-
OPEN_GRAPH_MEDIA_NAMES= ('og:image', 'og:audio', 'og:video', 'twitter:image:src', 'twitter:image0', 'twitter:image1', 'twitter:image2', 'twitter:image3', 'twitter:player:stream')¶
-
TAG_ATTRIBUTES= {'th': {'background': 1}, 'table': {'background': 1}, 'body': {'background': 1}, 'fig': {'src': 1}, 'bgsound': {'src': 1}, 'embed': {'href': 2, 'src': 3}, 'overlay': {'src': 3}, 'area': {'href': 2}, 'input': {'src': 1}, 'layer': {'src': 3}, 'a': {'href': 2}, 'iframe': {'src': 3}, 'frame': {'src': 3}, 'applet': {'code': 1}, 'img': {'href': 1, 'lowsrc': 1, 'src': 1}, 'td': {'background': 1}, 'form': {'action': 2}, 'script': {'src': 1}, 'object': {'data': 1}}¶ Mapping of element tag names to attributes containing links.
-
classmethod
is_html_link(tag, attribute)[source]¶ Return whether the link is likely to be external object.
-
classmethod
is_link_inline(tag, attribute)[source]¶ Return whether the link is likely to be inline object.
-
iter_links(elements)[source]¶ Iterate the document root for links.
Returns: A iterator of LinkedInfo.Return type: iterable
-
iter_links_by_js_attrib(attrib_name, attrib_value)[source]¶ Iterate links of a JavaScript pseudo-link attribute.
-
iter_links_link_element(element)[source]¶ Iterate a
linkfor URLs.This function handles stylesheets and icons in addition to standard scraping rules.
-
classmethod
iter_links_meta_element(element)[source]¶ Iterate the
metaelement for links.This function handles refresh URLs.
-
-
class
wpull.scraper.html.HTMLScraper(html_parser, element_walker, followed_tags=None, ignored_tags=None, robots=False, only_relative=False, encoding_override=None)[source]¶ Bases:
wpull.document.html.HTMLReader,wpull.scraper.base.BaseHTMLScraperScraper for HTML documents.
Parameters: - (class (element_walker) – .document.htmlparse.base.BaseParser): An HTML parser such as the lxml or html5lib one.
- (class – ElementWalker): HTML element walker.
- followed_tags – A list of tags that should be scraped
- ignored_tags – A list of tags that should not be scraped
- robots – If True, discard any links if they cannot be followed
- only_relative – If True, discard any links that are not absolute paths
-
wpull.scraper.html.LinkInfo¶ Information about a link in a lxml document.
-
wpull.scraper.html.element¶ An instance of
document.HTMLReadElement.
-
wpull.scraper.html.tag¶ str
The element tag name.
-
wpull.scraper.html.attrib¶ str, None
If
str, the name of the attribute. Otherwise, the link was found inelement.text.
-
wpull.scraper.html.link¶ str
The link found.
-
wpull.scraper.html.inline¶ bool
Whether the link is an embedded object (like images or stylesheets).
-
wpull.scraper.html.linked¶ bool
Whether the link is a link to another page.
-
wpull.scraper.html.base_link¶ str, None
The base URL.
-
wpull.scraper.html.value_type¶ str
Indicates how the link was found. Possible values are
plain: The link was found plainly in an attribute value.list: The link was found in a space separated list.css: The link was found in a CSS text.refresh: The link was found in a refresh meta string.script: The link was found in JavaScript text.srcset: The link was found in asrcsetattribute.
-
wpull.scraper.html.link_type¶ A value from
item.LinkInfo.
alias of
LinkInfoType-