scraper.base Module

Base classes

class wpull.scraper.base.BaseExtractiveScraper[source]

Bases: wpull.scraper.base.BaseScraper, wpull.document.base.BaseExtractiveReader

Return the links.

Returns:Each item is a str which represents a link.
Return type:iterator
class wpull.scraper.base.BaseHTMLScraper[source]

Bases: wpull.scraper.base.BaseScraper, wpull.document.base.BaseHTMLReader

class wpull.scraper.base.BaseScraper[source]

Bases: object

Base class for scrapers.

scrape(request, response, link_type=None)[source]

Extract the URLs from the document.

Parameters:
Returns:

LinkContexts and document information.

If None, then the scraper does not support scraping the document.

Return type:

ScrapeResult, None

class wpull.scraper.base.BaseTextStreamScraper[source]

Bases: wpull.scraper.base.BaseScraper, wpull.document.base.BaseTextStreamReader

Base class for scrapers that process either link and non-link text.

Return the links.

This function is a convenience function for calling iter_processed_text() and returning only the links.

iter_processed_text(file, encoding=None, base_url=None)[source]

Return the file text and processed absolute links.

Parameters:
  • file – A file object containing the document.
  • encoding (str) – The encoding of the document.
  • base_url (str) – The URL at which the document is located.
Returns:

Each item is a tuple:

  1. str: The text
  2. bool: Whether the text a link

Return type:

iterator

Convenience function for scraping from a text string.

class wpull.scraper.base.DemuxDocumentScraper(document_scrapers)[source]

Bases: wpull.scraper.base.BaseScraper

Puts multiple Document Scrapers into one.

scrape(request, response, link_type=None)[source]

Iterate the scrapers, returning the first of the results.

scrape_info(request, response, link_type=None)[source]

Iterate the scrapers and return a dict of results.

Returns:A dict where the keys are the scrapers instances and the values are the results. That is, a mapping from BaseDocumentScraper to ScrapeResult.
Return type:dict
wpull.scraper.base.LinkContext

A named tuple describing a scraped link.

str

The link that was scraped.

wpull.scraper.base.inline

bool

Whether the link is an embeded object.

wpull.scraper.base.linked

bool

Whether the link links to another page.

A value from item.LinkType.

wpull.scraper.base.extra

Any extra info.

alias of LinkContextType

class wpull.scraper.base.ScrapeResult(link_contexts, encoding)[source]

Bases: dict

Links scraped from a document.

This class is subclassed from dict and contains convenience methods.

encoding

Character encoding of the document.

inline

Link Context of objects embedded in the document.

URLs of objects embedded in the document.

Link Contexts.

linked

Link Context of objects linked from the document

URLs of objects linked from the document