warc.recorder Module

class wpull.warc.recorder.BaseWARCRecorderSession(recorder, temp_dir=None, url_table=None)[source]

Bases: object

Base WARC recorder session.

close()[source]
class wpull.warc.recorder.FTPWARCRecorderSession(*args, **kwargs)[source]

Bases: wpull.warc.recorder.BaseWARCRecorderSession

FTP WARC Recorder Session.

begin_control(request: wpull.protocol.ftp.request.Request, connection_reused: bool=False)[source]
begin_transfer(response: wpull.protocol.ftp.request.Response)[source]
close(error=None)[source]
control_receive_data(data)[source]
control_send_data(data)[source]
end_control(response: wpull.protocol.ftp.request.Response, connection_closed=False)[source]
end_transfer(response: wpull.protocol.ftp.request.Response)[source]
transfer_receive_data(data: bytes)[source]
class wpull.warc.recorder.HTTPWARCRecorderSession(*args, **kwargs)[source]

Bases: wpull.warc.recorder.BaseWARCRecorderSession

HTTP WARC Recorder Session.

begin_request(request: wpull.protocol.http.request.Request)[source]
begin_response(response: wpull.protocol.http.request.Response)[source]
close()[source]
end_request(request: wpull.protocol.http.request.Request)[source]
end_response(response: wpull.protocol.http.request.Response)[source]
request_data(data: bytes)[source]
response_data(data: bytes)[source]
class wpull.warc.recorder.WARCRecorder(filename, params=None)[source]

Bases: object

Record to WARC file.

Parameters:
  • filename (str) – The filename (without the extension).
  • params (WARCRecorderParams) – Parameters.
CDX_DELIMINATOR = ' '

Default CDX delimiter.

DEFAULT_SOFTWARE_STRING = 'Wpull/2.0.1 Python/3.4.3'

Default software string.

close()[source]

Close the WARC file and clean up any logging handlers.

flush_session()[source]
listen_to_ftp_client(client: wpull.protocol.ftp.client.Client)[source]
listen_to_http_client(client: wpull.protocol.http.client.Client)[source]
new_ftp_recorder_session() → 'FTPWARCRecorderSession'[source]
new_http_recorder_session() → 'HTTPWARCRecorderSession'[source]
classmethod parse_mimetype(value)[source]

Return the MIME type from a Content-Type string.

Returns:A string in the form type/subtype or None.
Return type:str, None
set_length_and_maybe_checksums(record, payload_offset=None)[source]

Set the content length and possibly the checksums.

write_record(record)[source]

Append the record to the WARC file.

wpull.warc.recorder.WARCRecorderParams

WARCRecorder parameters.

Parameters:
  • compress (bool) – If True, files will be compressed with gzip
  • extra_fields (list) – A list of key-value pairs containing extra metadata fields
  • temp_dir (str) – Directory to use for temporary files
  • log (bool) – Include the program logging messages in the WARC file
  • appending (bool) – If True, the file is not overwritten upon opening
  • digests (bool) – If True, the SHA1 hash digests will be written.
  • cdx (bool) – If True, a CDX file will be written.
  • max_size (int) – If provided, output files are named like name-00000.ext and the log file will be in name-meta.ext.
  • move_to (str) – If provided, completed WARC files and CDX files will be moved to the given directory
  • url_table (database.URLTable) – If given, then revist records will be written.
  • software_string (str) – The value for the software field in the Warcinfo record.

alias of WARCRecorderParamsType