scraper.html
Module¶
HTML link extractor.
-
class
wpull.scraper.html.
ElementWalker
(css_scraper=None, javascript_scraper=None)[source]¶ Bases:
object
-
ATTR_HTML
= 2¶ Flag for links that point to other documents.
-
ATTR_INLINE
= 1¶ Flag for embedded objects (like images, stylesheets) in documents.
-
DYNAMIC_ATTRIBUTES
= ('onkey', 'oncli', 'onmou')¶ Attributes that contain JavaScript.
-
LINK_ATTRIBUTES
= frozenset({'usemap', 'archive', 'lowsrc', 'classid', 'src', 'background', 'profile', 'dynsrc', 'href', 'cite', 'longdesc', 'codebase', 'data', 'action'})¶ HTML element attributes that may contain links.
-
OPEN_GRAPH_LINK_NAMES
= ('og:url', 'twitter:player')¶ Iterate elements looking for links.
Parameters: - css_scraper (
scraper.css.CSSScraper
) – Optional CSS scraper. - ( (javascript_scraper) – class:`.scraper.javascript.JavaScriptScraper): Optional JavaScript scraper.
- css_scraper (
-
OPEN_GRAPH_MEDIA_NAMES
= ('og:image', 'og:audio', 'og:video', 'twitter:image:src', 'twitter:image0', 'twitter:image1', 'twitter:image2', 'twitter:image3', 'twitter:player:stream')¶
-
TAG_ATTRIBUTES
= {'th': {'background': 1}, 'script': {'src': 1}, 'area': {'href': 2}, 'applet': {'code': 1}, 'frame': {'src': 3}, 'fig': {'src': 1}, 'form': {'action': 2}, 'object': {'data': 1}, 'bgsound': {'src': 1}, 'a': {'href': 2}, 'table': {'background': 1}, 'layer': {'src': 3}, 'iframe': {'src': 3}, 'input': {'src': 1}, 'body': {'background': 1}, 'overlay': {'src': 3}, 'td': {'background': 1}, 'img': {'lowsrc': 1, 'src': 1, 'href': 1}, 'embed': {'src': 3, 'href': 2}}¶ Mapping of element tag names to attributes containing links.
-
classmethod
is_html_link
(tag, attribute)[source]¶ Return whether the link is likely to be external object.
-
classmethod
is_link_inline
(tag, attribute)[source]¶ Return whether the link is likely to be inline object.
-
iter_links
(elements)[source]¶ Iterate the document root for links.
Returns: A iterator of LinkedInfo
.Return type: iterable
-
iter_links_by_js_attrib
(attrib_name, attrib_value)[source]¶ Iterate links of a JavaScript pseudo-link attribute.
-
iter_links_link_element
(element)[source]¶ Iterate a
link
for URLs.This function handles stylesheets and icons in addition to standard scraping rules.
-
classmethod
iter_links_meta_element
(element)[source]¶ Iterate the
meta
element for links.This function handles refresh URLs.
-
-
class
wpull.scraper.html.
HTMLScraper
(html_parser, element_walker, followed_tags=None, ignored_tags=None, robots=False, only_relative=False, encoding_override=None)[source]¶ Bases:
wpull.document.html.HTMLReader
,wpull.scraper.base.BaseHTMLScraper
Scraper for HTML documents.
Parameters: - (class (element_walker) – .document.htmlparse.base.BaseParser): An HTML parser such as the lxml or html5lib one.
- (class – ElementWalker): HTML element walker.
- followed_tags – A list of tags that should be scraped
- ignored_tags – A list of tags that should not be scraped
- robots – If True, discard any links if they cannot be followed
- only_relative – If True, discard any links that are not absolute paths
-
wpull.scraper.html.
LinkInfo
¶ Information about a link in a lxml document.
-
wpull.scraper.html.
element
¶ An instance of
document.HTMLReadElement
.
-
wpull.scraper.html.
tag
¶ str
The element tag name.
-
wpull.scraper.html.
attrib
¶ str, None
If
str
, the name of the attribute. Otherwise, the link was found inelement.text
.
-
wpull.scraper.html.
link
¶ str
The link found.
-
wpull.scraper.html.
inline
¶ bool
Whether the link is an embedded object (like images or stylesheets).
-
wpull.scraper.html.
linked
¶ bool
Whether the link is a link to another page.
-
wpull.scraper.html.
base_link
¶ str, None
The base URL.
-
wpull.scraper.html.
value_type
¶ str
Indicates how the link was found. Possible values are
plain
: The link was found plainly in an attribute value.list
: The link was found in a space separated list.css
: The link was found in a CSS text.refresh
: The link was found in a refresh meta string.script
: The link was found in JavaScript text.srcset
: The link was found in asrcset
attribute.
-
wpull.scraper.html.
link_type
¶ A value from
item.LinkInfo
.
alias of
LinkInfoType
-