scraper.base
Module¶
Base classes
-
class
wpull.scraper.base.
BaseExtractiveScraper
[source]¶ Bases:
wpull.scraper.base.BaseScraper
,wpull.document.base.BaseExtractiveReader
-
class
wpull.scraper.base.
BaseHTMLScraper
[source]¶ Bases:
wpull.scraper.base.BaseScraper
,wpull.document.base.BaseHTMLReader
-
class
wpull.scraper.base.
BaseScraper
[source]¶ Bases:
object
Base class for scrapers.
-
scrape
(request, response, link_type=None)[source]¶ Extract the URLs from the document.
Parameters: - request (
http.request.Request
) – The request. - response (
http.request.Response
) – The response. - link_type – A value from
item.LinkType
.
Returns: LinkContexts and document information.
If None, then the scraper does not support scraping the document.
Return type: ScrapeResult, None
- request (
-
-
class
wpull.scraper.base.
BaseTextStreamScraper
[source]¶ Bases:
wpull.scraper.base.BaseScraper
,wpull.document.base.BaseTextStreamReader
Base class for scrapers that process either link and non-link text.
-
iter_processed_links
(file, encoding=None, base_url=None, context=False)[source]¶ Return the links.
This function is a convenience function for calling
iter_processed_text()
and returning only the links.
-
iter_processed_text
(file, encoding=None, base_url=None)[source]¶ Return the file text and processed absolute links.
Parameters: - file – A file object containing the document.
- encoding (str) – The encoding of the document.
- base_url (str) – The URL at which the document is located.
Returns: Each item is a tuple:
- str: The text
- bool: Whether the text a link
Return type: iterator
-
-
class
wpull.scraper.base.
DemuxDocumentScraper
(document_scrapers)[source]¶ Bases:
wpull.scraper.base.BaseScraper
Puts multiple Document Scrapers into one.
-
scrape
(request, response, link_type=None)[source]¶ Iterate the scrapers, returning the first of the results.
-
scrape_info
(request, response, link_type=None)[source]¶ Iterate the scrapers and return a dict of results.
Returns: A dict where the keys are the scrapers instances and the values are the results. That is, a mapping from BaseDocumentScraper
toScrapeResult
.Return type: dict
-
-
wpull.scraper.base.
LinkContext
¶ A named tuple describing a scraped link.
-
wpull.scraper.base.
link
¶ str
The link that was scraped.
-
wpull.scraper.base.
inline
¶ bool
Whether the link is an embeded object.
-
wpull.scraper.base.
linked
¶ bool
Whether the link links to another page.
-
wpull.scraper.base.
link_type
¶ A value from
item.LinkType
.
-
wpull.scraper.base.
extra
¶ Any extra info.
alias of
LinkContextType
-
-
class
wpull.scraper.base.
ScrapeResult
(link_contexts, encoding)[source]¶ Bases:
dict
Links scraped from a document.
This class is subclassed from
dict
and contains convenience methods.-
encoding
¶ Character encoding of the document.
-
inline
¶ Link Context of objects embedded in the document.
-
inline_links
¶ URLs of objects embedded in the document.
-
link_contexts
¶ Link Contexts.
-
linked
¶ Link Context of objects linked from the document
-
linked_links
¶ URLs of objects linked from the document
-