pipeline.item Module¶
URL items.
-
class
wpull.pipeline.item.LinkType[source]¶ Bases:
enum.EnumThe type of contents that a link is expected to have.
-
css= None¶ Stylesheet file. Recursion on links is usually safe.
-
directory= None¶ FTP directory.
-
file= None¶ FTP File.
-
html= None¶ HTML document.
-
javascript= None¶ JavaScript file. Possible to recurse links on this file.
-
media= None¶ Image or video file. Recursion on this type will not be useful.
-
sitemap= None¶ A Sitemap.xml file.
-
-
class
wpull.pipeline.item.Status[source]¶ Bases:
enum.EnumURL status.
-
done= None¶ The item has been processed successfully.
-
error= None¶ The item encountered an error during processing.
-
in_progress= None¶ The item is in progress of being processed.
-
skipped= None¶ The item was excluded from processing due to some rejection filters.
-
todo= None¶ The item has not yet been processed.
-
-
class
wpull.pipeline.item.URLData[source]¶ Bases:
wpull.pipeline.item.URLDatabaseMixinData associated fetching the URL.
- post_data (str): If given, the URL should be fetched as a
- POST request containing post_data.
-
database_attributes= ('post_data',)¶
-
class
wpull.pipeline.item.URLProperties[source]¶ Bases:
wpull.pipeline.item.URLDatabaseMixinURL properties that determine whether a URL is fetched.
-
parent_url¶ str
The parent or referral URL that linked to this URL.
-
root_url¶ str
The earliest ancestor URL of this URL. This URL is typically the URL supplied at the start of the program.
-
status¶ Status
Processing status of this URL.
-
try_count¶ int
The number of attempts on this URL.
-
level¶ int
The recursive depth of this URL. A level of
0indicates the URL was initially supplied to the program (the top URL). Level1means the URL was linked from the top URL.
-
inline_level¶ int
Whether this URL was an embedded object (such as an image or a stylesheet) of the parent URL.
The value represents the recursive depth of the object. For example, an iframe is depth 1 and the images in the iframe is depth 2.
-
link_type¶ LinkType
Describes the expected document type.
-
database_attributes= ('parent_url', 'root_url', 'status', 'try_count', 'level', 'inline_level', 'link_type', 'priority')¶
-
parent_url_info¶ Return URL Info for the parent URL
-
root_url_info¶ Return URL Info for the root URL
-
-
class
wpull.pipeline.item.URLRecord[source]¶ Bases:
wpull.pipeline.item.URLProperties,wpull.pipeline.item.URLData,wpull.pipeline.item.URLResultAn entry in the URL table describing a URL to be downloaded.
-
url¶ str
The URL.
-
url_info¶ Return URL Info for this URL
-
-
class
wpull.pipeline.item.URLResult[source]¶ Bases:
wpull.pipeline.item.URLDatabaseMixinData associated with the fetched URL.
status_code (int): The HTTP or FTP status code. filename (str): The path to where the file was saved.
-
database_attributes= ('status_code', 'filename')¶
-