pipeline.item Module

URL items.

class wpull.pipeline.item.LinkType[source]

Bases: enum.Enum

The type of contents that a link is expected to have.

css = None

Stylesheet file. Recursion on links is usually safe.

directory = None

FTP directory.

file = None

FTP File.

html = None

HTML document.

javascript = None

JavaScript file. Possible to recurse links on this file.

media = None

Image or video file. Recursion on this type will not be useful.

sitemap = None

A Sitemap.xml file.

class wpull.pipeline.item.Status[source]

Bases: enum.Enum

URL status.

done = None

The item has been processed successfully.

error = None

The item encountered an error during processing.

in_progress = None

The item is in progress of being processed.

skipped = None

The item was excluded from processing due to some rejection filters.

todo = None

The item has not yet been processed.

class wpull.pipeline.item.URLData[source]

Bases: wpull.pipeline.item.URLDatabaseMixin

Data associated fetching the URL.

post_data (str): If given, the URL should be fetched as a
POST request containing post_data.
database_attributes = ('post_data',)
class wpull.pipeline.item.URLDatabaseMixin[source]

Bases: object

database_items()[source]
class wpull.pipeline.item.URLProperties[source]

Bases: wpull.pipeline.item.URLDatabaseMixin

URL properties that determine whether a URL is fetched.

parent_url

str

The parent or referral URL that linked to this URL.

root_url

str

The earliest ancestor URL of this URL. This URL is typically the URL supplied at the start of the program.

status

Status

Processing status of this URL.

try_count

int

The number of attempts on this URL.

level

int

The recursive depth of this URL. A level of 0 indicates the URL was initially supplied to the program (the top URL). Level 1 means the URL was linked from the top URL.

inline_level

int

Whether this URL was an embedded object (such as an image or a stylesheet) of the parent URL.

The value represents the recursive depth of the object. For example, an iframe is depth 1 and the images in the iframe is depth 2.

LinkType

Describes the expected document type.

database_attributes = ('parent_url', 'root_url', 'status', 'try_count', 'level', 'inline_level', 'link_type', 'priority')
parent_url_info

Return URL Info for the parent URL

root_url_info

Return URL Info for the root URL

class wpull.pipeline.item.URLRecord[source]

Bases: wpull.pipeline.item.URLProperties, wpull.pipeline.item.URLData, wpull.pipeline.item.URLResult

An entry in the URL table describing a URL to be downloaded.

url

str

The URL.

url_info

Return URL Info for this URL

class wpull.pipeline.item.URLResult[source]

Bases: wpull.pipeline.item.URLDatabaseMixin

Data associated with the fetched URL.

status_code (int): The HTTP or FTP status code. filename (str): The path to where the file was saved.

database_attributes = ('status_code', 'filename')