pipeline.item
Module¶
URL items.
-
class
wpull.pipeline.item.
LinkType
[source]¶ Bases:
enum.Enum
The type of contents that a link is expected to have.
-
css
= None¶ Stylesheet file. Recursion on links is usually safe.
-
directory
= None¶ FTP directory.
-
file
= None¶ FTP File.
-
html
= None¶ HTML document.
-
javascript
= None¶ JavaScript file. Possible to recurse links on this file.
-
media
= None¶ Image or video file. Recursion on this type will not be useful.
-
sitemap
= None¶ A Sitemap.xml file.
-
-
class
wpull.pipeline.item.
Status
[source]¶ Bases:
enum.Enum
URL status.
-
done
= None¶ The item has been processed successfully.
-
error
= None¶ The item encountered an error during processing.
-
in_progress
= None¶ The item is in progress of being processed.
-
skipped
= None¶ The item was excluded from processing due to some rejection filters.
-
todo
= None¶ The item has not yet been processed.
-
-
class
wpull.pipeline.item.
URLData
[source]¶ Bases:
wpull.pipeline.item.URLDatabaseMixin
Data associated fetching the URL.
- post_data (str): If given, the URL should be fetched as a
- POST request containing post_data.
-
database_attributes
= ('post_data',)¶
-
class
wpull.pipeline.item.
URLProperties
[source]¶ Bases:
wpull.pipeline.item.URLDatabaseMixin
URL properties that determine whether a URL is fetched.
-
parent_url
¶ str
The parent or referral URL that linked to this URL.
-
root_url
¶ str
The earliest ancestor URL of this URL. This URL is typically the URL supplied at the start of the program.
-
status
¶ Status
Processing status of this URL.
-
try_count
¶ int
The number of attempts on this URL.
-
level
¶ int
The recursive depth of this URL. A level of
0
indicates the URL was initially supplied to the program (the top URL). Level1
means the URL was linked from the top URL.
-
inline_level
¶ int
Whether this URL was an embedded object (such as an image or a stylesheet) of the parent URL.
The value represents the recursive depth of the object. For example, an iframe is depth 1 and the images in the iframe is depth 2.
-
link_type
¶ LinkType
Describes the expected document type.
-
database_attributes
= ('parent_url', 'root_url', 'status', 'try_count', 'level', 'inline_level', 'link_type', 'priority')¶
-
parent_url_info
¶ Return URL Info for the parent URL
-
root_url_info
¶ Return URL Info for the root URL
-
-
class
wpull.pipeline.item.
URLRecord
[source]¶ Bases:
wpull.pipeline.item.URLProperties
,wpull.pipeline.item.URLData
,wpull.pipeline.item.URLResult
An entry in the URL table describing a URL to be downloaded.
-
url
¶ str
The URL.
-
url_info
¶ Return URL Info for this URL
-
-
class
wpull.pipeline.item.
URLResult
[source]¶ Bases:
wpull.pipeline.item.URLDatabaseMixin
Data associated with the fetched URL.
status_code (int): The HTTP or FTP status code. filename (str): The path to where the file was saved.
-
database_attributes
= ('status_code', 'filename')¶
-