`scraper.base` Module¶

Base classes

class wpull.scraper.base.BaseExtractiveScraper[source]¶

Bases: wpull.scraper.base.BaseScraper, wpull.document.base.BaseExtractiveReader

iter_processed_links(file, encoding=None, base_url=None)[source]¶

Return the links.

Returns:	Each item is a str which represents a link.
Return type:	iterator

class wpull.scraper.base.BaseHTMLScraper[source]¶: Bases: wpull.scraper.base.BaseScraper, wpull.document.base.BaseHTMLReader

class wpull.scraper.base.BaseScraper[source]¶

Bases: object

Base class for scrapers.

scrape(request, response, link_type=None)[source]¶

Extract the URLs from the document.

Parameters:

request (http.request.Request) – The request.
response (http.request.Response) – The response.
link_type – A value from item.LinkType.

Returns:

LinkContexts and document information.

If None, then the scraper does not support scraping the document.

Return type:

ScrapeResult, None

class wpull.scraper.base.BaseTextStreamScraper[source]¶

Bases: wpull.scraper.base.BaseScraper, wpull.document.base.BaseTextStreamReader

Base class for scrapers that process either link and non-link text.

iter_processed_links(file, encoding=None, base_url=None, context=False)[source]¶

Return the links.

This function is a convenience function for calling iter_processed_text() and returning only the links.

iter_processed_text(file, encoding=None, base_url=None)[source]¶

Return the file text and processed absolute links.

Parameters:

file – A file object containing the document.
encoding (str) – The encoding of the document.
base_url (str) – The URL at which the document is located.

Returns:

Each item is a tuple:

str: The text
bool: Whether the text a link

Return type:

iterator

scrape_links(text, context=False)[source]¶: Convenience function for scraping from a text string.

class wpull.scraper.base.DemuxDocumentScraper(document_scrapers)[source]¶

Bases: wpull.scraper.base.BaseScraper

Puts multiple Document Scrapers into one.

scrape(request, response, link_type=None)[source]¶: Iterate the scrapers, returning the first of the results.

scrape_info(request, response, link_type=None)[source]¶

Iterate the scrapers and return a dict of results.

Returns:	A dict where the keys are the scrapers instances and the values are the results. That is, a mapping from `BaseDocumentScraper` to `ScrapeResult`.
Return type:	dict

wpull.scraper.base.LinkContext¶

A named tuple describing a scraped link.

wpull.scraper.base.link¶

str

The link that was scraped.

wpull.scraper.base.inline¶

bool

Whether the link is an embeded object.

wpull.scraper.base.linked¶

bool

Whether the link links to another page.

wpull.scraper.base.link_type¶: A value from item.LinkType.

wpull.scraper.base.extra¶: Any extra info.

alias of LinkContextType

class wpull.scraper.base.ScrapeResult(link_contexts, encoding)[source]¶

Bases: dict

Links scraped from a document.

This class is subclassed from dict and contains convenience methods.

encoding¶: Character encoding of the document.

inline¶: Link Context of objects embedded in the document.

inline_links¶: URLs of objects embedded in the document.

link_contexts¶: Link Contexts.

linked¶: Link Context of objects linked from the document

linked_links¶: URLs of objects linked from the document

scraper.base Module¶

`scraper.base` Module¶