warc.format Module

WARC format.

For the WARC file specification, see http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf.

For the CDX specifications, see https://archive.org/web/researcher/cdx_file_format.php and https://github.com/internetarchive/CDX-Writer.

class wpull.warc.format.WARCRecord[source]

Bases: object

A record in a WARC file.

fields

An instance of namevalue.NameValueRecord.

block_file

A file object. May be None.

CONTENT_TYPE = 'Content-Type'
NAME_OVERRIDES = frozenset({'WARC-Segment-Number', 'WARC-Record-ID', 'WARC-Date', 'WARC-Type', 'Content-Type', 'WARC-Target-URI', 'WARC-Segment-Total-Length', 'WARC-Concurrent-To', 'WARC-Refers-To', 'WARC-Payload-Digest', 'WARC-Truncated', 'WARC-Identified-Payload-Type', 'WARC-Segment-Origin-ID', 'WARC-IP-Address', 'WARC-Warcinfo-ID', 'Content-Length', 'WARC-Filename', 'WARC-Block-Digest', 'WARC-Profile'})

Field name case normalization overrides because hanzo’s warc-tools do not adequately conform to specifications.

REQUEST = 'request'
RESPONSE = 'response'
REVISIT = 'revisit'
SAME_PAYLOAD_DIGEST_URI = 'http://netpreserve.org/warc/1.0/revisit/identical-payload-digest'
TYPE_REQUEST = 'application/http;msgtype=request'
TYPE_RESPONSE = 'application/http;msgtype=response'
VERSION = 'WARC/1.0'
WARCINFO = 'warcinfo'
WARC_DATE = 'WARC-Date'
WARC_FIELDS = 'application/warc-fields'
WARC_RECORD_ID = 'WARC-Record-ID'
WARC_TYPE = 'WARC-Type'
compute_checksum(payload_offset: typing.Union=None)[source]

Compute and add the checksum data to the record fields.

This function also sets the content length.

get_http_header() → wpull.protocol.http.request.Response[source]

Return the HTTP header.

It only attempts to read the first 4 KiB of the payload.

Returns:Returns an instance of http.request.Response or None.
Return type:Response, None
set_common_fields(warc_type: str, content_type: str)[source]

Set the required fields for the record.

set_content_length()[source]

Find and set the content length.

See also

compute_checksum().

wpull.warc.format.read_cdx(file, encoding='utf8')[source]

Iterate CDX file.

Parameters:
  • file (str) – A file object.
  • encoding (str) – The encoding of the file.
Returns:

Each item is a dict that maps from field key to value.

Return type:

iterator