warc.recorder
Module¶
-
class
wpull.warc.recorder.
BaseWARCRecorderSession
(recorder, temp_dir=None, url_table=None)[source]¶ Bases:
object
Base WARC recorder session.
-
class
wpull.warc.recorder.
FTPWARCRecorderSession
(*args, **kwargs)[source]¶ Bases:
wpull.warc.recorder.BaseWARCRecorderSession
FTP WARC Recorder Session.
-
class
wpull.warc.recorder.
HTTPWARCRecorderSession
(*args, **kwargs)[source]¶ Bases:
wpull.warc.recorder.BaseWARCRecorderSession
HTTP WARC Recorder Session.
-
class
wpull.warc.recorder.
WARCRecorder
(filename, params=None)[source]¶ Bases:
object
Record to WARC file.
Parameters: - filename (str) – The filename (without the extension).
- params (
WARCRecorderParams
) – Parameters.
-
CDX_DELIMINATOR
= ' '¶ Default CDX delimiter.
-
DEFAULT_SOFTWARE_STRING
= 'Wpull/2.0.1 Python/3.4.3'¶ Default software string.
-
classmethod
parse_mimetype
(value)[source]¶ Return the MIME type from a Content-Type string.
Returns: A string in the form type/subtype
or None.Return type: str, None
-
wpull.warc.recorder.
WARCRecorderParams
¶ WARCRecorder
parameters.Parameters: - compress (bool) – If True, files will be compressed with gzip
- extra_fields (list) – A list of key-value pairs containing extra metadata fields
- temp_dir (str) – Directory to use for temporary files
- log (bool) – Include the program logging messages in the WARC file
- appending (bool) – If True, the file is not overwritten upon opening
- digests (bool) – If True, the SHA1 hash digests will be written.
- cdx (bool) – If True, a CDX file will be written.
- max_size (int) – If provided, output files are named like
name-00000.ext
and the log file will be inname-meta.ext
. - move_to (str) – If provided, completed WARC files and CDX files will be moved to the given directory
- url_table (
database.URLTable
) – If given, thenrevist
records will be written. - software_string (str) – The value for the
software
field in the Warcinfo record.
alias of
WARCRecorderParamsType