processor.web Module

Web processing.

exception wpull.processor.web.HookPreResponseBreak[source]

Bases: wpull.errors.ProtocolError

Hook pre-response break.

class wpull.processor.web.WebProcessor(web_client: wpull.protocol.http.web.WebClient, fetch_params: wpull.processor.web.WebProcessorFetchParamsType)[source]

Bases: wpull.processor.base.BaseProcessor, wpull.application.hook.HookableMixin

HTTP processor.

Parameters:
  • web_client – The web client.
  • fetch_params – Fetch parameters
DOCUMENT_STATUS_CODES = (200, 204, 206, 304)

Default status codes considered successfully fetching a document.

NO_DOCUMENT_STATUS_CODES = (401, 403, 404, 405, 410)

Default status codes considered a permanent error.

close()[source]

Close the web client.

fetch_params

The fetch parameters.

process(item_session: wpull.pipeline.session.ItemSession)[source]
web_client

The web client.

wpull.processor.web.WebProcessorFetchParams

WebProcessorFetchParams

Parameters:
  • post_data (str) – If provided, all requests will be POSTed with the given post_data. post_data must be in percent-encoded query format (“application/x-www-form-urlencoded”).
  • strong_redirects (bool) – If True, redirects are allowed to span hosts.

alias of WebProcessorFetchParamsType

class wpull.processor.web.WebProcessorSession(processor: wpull.processor.web.WebProcessor, item_session: wpull.pipeline.session.ItemSession)[source]

Bases: wpull.processor.base.BaseProcessorSession

Fetches an HTTP document.

This Processor Session will handle document redirects within the same Session. HTTP errors such as 404 are considered permanent errors. HTTP errors like 500 are considered transient errors and are handled in subsequence sessions by marking the item as “error”.

If a successful document has been downloaded, it will be scraped for URLs to be added to the URL table. This Processor Session is very simple; it cannot handle JavaScript or Flash plugins.

close()[source]

Close any temp files.

process()[source]