scraper.util
Module¶
Misc functions.
-
wpull.scraper.util.
clean_link_soup
(link)[source]¶ Strip whitespace from a link in HTML soup.
Parameters: link (str) – A string containing the link with lots of whitespace. The link is split into lines. For each line, leading and trailing whitespace is removed and tabs are removed throughout. The lines are concatenated and returned.
For example, passing the
href
value of:<a href=" http://example.com/ blog/entry/ how smaug stole all the bitcoins.html ">
will return
http://example.com/blog/entry/how smaug stole all the bitcoins.html
.Returns: The cleaned link. Return type: str
-
wpull.scraper.util.
identify_link_type
(filename)[source]¶ Return link type guessed by filename extension.
Returns: A value from item.LinkType
.Return type: str
-
wpull.scraper.util.
is_likely_link
(text)[source]¶ Return whether the text is likely to be a link.
This function assumes that leading/trailing whitespace has already been removed.
Returns: bool