pipeline.session Module

class wpull.pipeline.session.ItemSession(app_session: wpull.pipeline.app.AppSession, url_record: wpull.pipeline.item.URLRecord)[source]

Bases: object

Item for a URL that needs to processed.

add_child_url(url: str, inline: bool=False, link_type: typing.Union=None, post_data: typing.Union=None, level: typing.Union=None, replace: bool=False)[source]

Add links scraped from the document with automatic values.

Parameters:
  • url – A full URL. (It can’t be a relative path.)
  • inline – Whether the URL is an embedded object.
  • link_type – Expected link type.
  • post_data – URL encoded form data. The request will be made using POST. (Don’t use this to upload files.)
  • level – The child depth of this URL.
  • replace – Whether to replace the existing entry in the database table so it will be redownloaded again.

This function provides values automatically for:

  • inline
  • level
  • parent: The referrering page.
  • root

See also add_url().

add_url(url: str, url_properites: typing.Union=None, url_data: typing.Union=None)[source]
child_url_record(url: str, inline: bool=False, link_type: typing.Union=None, post_data: typing.Union=None, level: typing.Union=None)[source]

Return a child URLRecord.

This function is useful for testing filters before adding to table.

finish()[source]
is_processed

Return whether the item has been processed.

is_virtual
request
response
set_status(status: wpull.pipeline.item.Status, increment_try_count: bool=True, filename: str=None)[source]

Mark the item with the given status.

Parameters:
  • status – a value from Status.
  • increment_try_count – if True, increment the try_count value
skip()[source]

Mark the item as processed without download.

update_record_value(**kwargs)[source]
class wpull.pipeline.session.URLItemSource(app_session: wpull.pipeline.app.AppSession)[source]

Bases: wpull.pipeline.pipeline.ItemSource

get_item() → typing.Union[source]