warc.format
Module¶
WARC format.
For the WARC file specification, see http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf.
For the CDX specifications, see https://archive.org/web/researcher/cdx_file_format.php and https://github.com/internetarchive/CDX-Writer.
-
class
wpull.warc.format.
WARCRecord
[source]¶ Bases:
object
A record in a WARC file.
-
fields
¶ An instance of
namevalue.NameValueRecord
.
-
block_file
¶ A file object. May be None.
-
CONTENT_TYPE
= 'Content-Type'¶
-
NAME_OVERRIDES
= frozenset({'WARC-Filename', 'WARC-Type', 'WARC-Warcinfo-ID', 'WARC-Profile', 'WARC-Record-ID', 'Content-Type', 'WARC-Concurrent-To', 'WARC-Segment-Number', 'WARC-Refers-To', 'WARC-Payload-Digest', 'WARC-IP-Address', 'WARC-Identified-Payload-Type', 'WARC-Date', 'Content-Length', 'WARC-Target-URI', 'WARC-Block-Digest', 'WARC-Truncated', 'WARC-Segment-Total-Length', 'WARC-Segment-Origin-ID'})¶ Field name case normalization overrides because hanzo’s warc-tools do not adequately conform to specifications.
-
REQUEST
= 'request'¶
-
RESPONSE
= 'response'¶
-
REVISIT
= 'revisit'¶
-
SAME_PAYLOAD_DIGEST_URI
= 'http://netpreserve.org/warc/1.0/revisit/identical-payload-digest'¶
-
TYPE_REQUEST
= 'application/http;msgtype=request'¶
-
TYPE_RESPONSE
= 'application/http;msgtype=response'¶
-
VERSION
= 'WARC/1.0'¶
-
WARCINFO
= 'warcinfo'¶
-
WARC_DATE
= 'WARC-Date'¶
-
WARC_FIELDS
= 'application/warc-fields'¶
-
WARC_RECORD_ID
= 'WARC-Record-ID'¶
-
WARC_TYPE
= 'WARC-Type'¶
-
compute_checksum
(payload_offset: typing.Union=None)[source]¶ Compute and add the checksum data to the record fields.
This function also sets the content length.
-
get_http_header
() → wpull.protocol.http.request.Response[source]¶ Return the HTTP header.
It only attempts to read the first 4 KiB of the payload.
Returns: Returns an instance of http.request.Response
or None.Return type: Response, None
-
set_common_fields
(warc_type: str, content_type: str)[source]¶ Set the required fields for the record.
-