warc.recorder Module¶
-
class
wpull.warc.recorder.BaseWARCRecorderSession(recorder, temp_dir=None, url_table=None)[source]¶ Bases:
objectBase WARC recorder session.
-
class
wpull.warc.recorder.FTPWARCRecorderSession(*args, **kwargs)[source]¶ Bases:
wpull.warc.recorder.BaseWARCRecorderSessionFTP WARC Recorder Session.
-
class
wpull.warc.recorder.HTTPWARCRecorderSession(*args, **kwargs)[source]¶ Bases:
wpull.warc.recorder.BaseWARCRecorderSessionHTTP WARC Recorder Session.
-
class
wpull.warc.recorder.WARCRecorder(filename, params=None)[source]¶ Bases:
objectRecord to WARC file.
Parameters: - filename (str) – The filename (without the extension).
- params (
WARCRecorderParams) – Parameters.
-
CDX_DELIMINATOR= ' '¶ Default CDX delimiter.
-
DEFAULT_SOFTWARE_STRING= 'Wpull/2.0.1 Python/3.4.3'¶ Default software string.
-
classmethod
parse_mimetype(value)[source]¶ Return the MIME type from a Content-Type string.
Returns: A string in the form type/subtypeor None.Return type: str, None
-
wpull.warc.recorder.WARCRecorderParams¶ WARCRecorderparameters.Parameters: - compress (bool) – If True, files will be compressed with gzip
- extra_fields (list) – A list of key-value pairs containing extra metadata fields
- temp_dir (str) – Directory to use for temporary files
- log (bool) – Include the program logging messages in the WARC file
- appending (bool) – If True, the file is not overwritten upon opening
- digests (bool) – If True, the SHA1 hash digests will be written.
- cdx (bool) – If True, a CDX file will be written.
- max_size (int) – If provided, output files are named like
name-00000.extand the log file will be inname-meta.ext. - move_to (str) – If provided, completed WARC files and CDX files will be moved to the given directory
- url_table (
database.URLTable) – If given, thenrevistrecords will be written. - software_string (str) – The value for the
softwarefield in the Warcinfo record.
alias of
WARCRecorderParamsType