scraper.base Module¶
Base classes
-
class
wpull.scraper.base.BaseExtractiveScraper[source]¶ Bases:
wpull.scraper.base.BaseScraper,wpull.document.base.BaseExtractiveReader
-
class
wpull.scraper.base.BaseHTMLScraper[source]¶ Bases:
wpull.scraper.base.BaseScraper,wpull.document.base.BaseHTMLReader
-
class
wpull.scraper.base.BaseScraper[source]¶ Bases:
objectBase class for scrapers.
-
scrape(request, response, link_type=None)[source]¶ Extract the URLs from the document.
Parameters: - request (
http.request.Request) – The request. - response (
http.request.Response) – The response. - link_type – A value from
item.LinkType.
Returns: LinkContexts and document information.
If None, then the scraper does not support scraping the document.
Return type: ScrapeResult, None
- request (
-
-
class
wpull.scraper.base.BaseTextStreamScraper[source]¶ Bases:
wpull.scraper.base.BaseScraper,wpull.document.base.BaseTextStreamReaderBase class for scrapers that process either link and non-link text.
-
iter_processed_links(file, encoding=None, base_url=None, context=False)[source]¶ Return the links.
This function is a convenience function for calling
iter_processed_text()and returning only the links.
-
iter_processed_text(file, encoding=None, base_url=None)[source]¶ Return the file text and processed absolute links.
Parameters: - file – A file object containing the document.
- encoding (str) – The encoding of the document.
- base_url (str) – The URL at which the document is located.
Returns: Each item is a tuple:
- str: The text
- bool: Whether the text a link
Return type: iterator
-
-
class
wpull.scraper.base.DemuxDocumentScraper(document_scrapers)[source]¶ Bases:
wpull.scraper.base.BaseScraperPuts multiple Document Scrapers into one.
-
scrape(request, response, link_type=None)[source]¶ Iterate the scrapers, returning the first of the results.
-
scrape_info(request, response, link_type=None)[source]¶ Iterate the scrapers and return a dict of results.
Returns: A dict where the keys are the scrapers instances and the values are the results. That is, a mapping from BaseDocumentScrapertoScrapeResult.Return type: dict
-
-
wpull.scraper.base.LinkContext¶ A named tuple describing a scraped link.
-
wpull.scraper.base.link¶ str
The link that was scraped.
-
wpull.scraper.base.inline¶ bool
Whether the link is an embeded object.
-
wpull.scraper.base.linked¶ bool
Whether the link links to another page.
-
wpull.scraper.base.link_type¶ A value from
item.LinkType.
-
wpull.scraper.base.extra¶ Any extra info.
alias of
LinkContextType-
-
class
wpull.scraper.base.ScrapeResult(link_contexts, encoding)[source]¶ Bases:
dictLinks scraped from a document.
This class is subclassed from
dictand contains convenience methods.-
encoding¶ Character encoding of the document.
-
inline¶ Link Context of objects embedded in the document.
-
inline_links¶ URLs of objects embedded in the document.
-
link_contexts¶ Link Contexts.
-
linked¶ Link Context of objects linked from the document
-
linked_links¶ URLs of objects linked from the document
-