API Reference¶

Document model

Model serialization and binary references

class warcat.model.binary.BytesSerializable[source]¶

Metaclass that indicates this object can be serialized to bytes

iter_bytes()[source]¶: Return an iterable of bytes

class warcat.model.binary.StrSerializable[source]¶

Metaclass that indicates this object can be serialized to str

iter_str()[source]¶: Return an iterable of str

class warcat.model.binary.BinaryFileRef[source]¶

Reference to a file containing the content block data.

file_offset¶: When reading, the file is seeked to file_offset.

length¶: The length of the data

filename¶: The filename of the referenced data. It must be a valid file.

file_obj¶: The file object to be read from. It is important that this file object is not shared or race conditions will occur. File objects are not closed automatically.

Note

Either filename or file_obj must be set.

get_file(safe=True, spool_size=10485760)[source]¶

Return a file object with the data.

Parameters:

safe –

If True, return a new file object that is a copy of the data. You will be responsible for closing the file.

Otherwise, it will be the original file object that is seeked to the correct offset. Be sure to not read beyond its length and seek back to the original position if necessary.

iter_file(buffer_size=4096)[source]¶: Return an iterable of bytes of the source data

set_file(file, offset=0, length=None)[source]¶

Set the reference to the file or filename with the data.

This is a convenience function to setting the attributes individually.

Content blocks and payload blocks

class warcat.model.block.ContentBlock[source]¶

iter_bytes()¶: Return an iterable of bytes

classmethod load(file_obj, length, content_type)[source]¶: Load and return BinaryBlock or BlockWithPayload

class warcat.model.block.BinaryBlock[source]¶

A content block that is octet data

get_file(safe=True, spool_size=10485760)¶

Return a file object with the data.

Parameters:

safe –

If True, return a new file object that is a copy of the data. You will be responsible for closing the file.

Otherwise, it will be the original file object that is seeked to the correct offset. Be sure to not read beyond its length and seek back to the original position if necessary.

iter_bytes()[source]¶

iter_file(buffer_size=4096)¶: Return an iterable of bytes of the source data

classmethod load(file_obj, length)[source]¶: Return a BinaryBlock using given file object

set_file(file, offset=0, length=None)¶

Set the reference to the file or filename with the data.

This is a convenience function to setting the attributes individually.

class warcat.model.block.BlockWithPayload(fields=None, payload=None)[source]¶

A content block (fields/data) within a Record.

fields¶: Fields

payload¶: Payload

binary_block¶: If this block was loaded from a file, this attribute will be a BinaryBlock of the original file. Otherwise, this attribute is None.

iter_bytes()[source]¶

length¶: Return the new computed length

classmethod load(file_obj, length, field_cls)[source]¶

Return a BlockWithPayload

Parameters:	file_obj – The file object length – How much to read from the file field_cls – The class or subclass of `Fields`

class warcat.model.block.Payload[source]¶

Data within a content block that has fields

get_file(safe=True, spool_size=10485760)¶

Return a file object with the data.

Parameters:

safe –

If True, return a new file object that is a copy of the data. You will be responsible for closing the file.

Otherwise, it will be the original file object that is seeked to the correct offset. Be sure to not read beyond its length and seek back to the original position if necessary.

iter_bytes()[source]¶

iter_file(buffer_size=4096)¶: Return an iterable of bytes of the source data

set_file(file, offset=0, length=None)¶

Set the reference to the file or filename with the data.

This is a convenience function to setting the attributes individually.

Constants and things

warcat.model.common.FIELD_DELIM_BYTES = b'\r\n\r\n'¶: Bytes CR LF CR LF

warcat.model.common.NEWLINE = '\r\n'¶: String CR LF

warcat.model.common.NEWLINE_BYTES = b'\r\n'¶: Bytes CR LF

Named fields

class warcat.model.field.Fields(field_list=None)[source]¶

Name and value pseudo-map list

Behaves like a dict or mutable mapping. Mutable mapping operations remove any duplicates in the field list.

add(name, value)[source]¶: Append a name-value field to the list

clear()[source]¶

count(name)[source]¶: Count the number of times this name occurs in the list

get(name, default=None)[source]¶

get_list(name)[source]¶: Return a list of values

index(name)[source]¶: Return the index of the first occurance of given name

iter_bytes()[source]¶

iter_str()[source]¶

classmethod join_multilines(value, lines)[source]¶: Scan for multiline value which is prefixed with a space or tab

keys()[source]¶

list()[source]¶: Return the underlying list

classmethod parse(s, newline='\r\n')[source]¶: Parse a named field string and return a Fields

values()[source]¶

class warcat.model.field.HTTPHeader(field_list=None, status=None)[source]¶

Fields extended with a HTTP status attribute.

status¶: The str of the HTTP status message and code.

add(name, value)¶: Append a name-value field to the list

clear()¶

count(name)¶: Count the number of times this name occurs in the list

get(name, default=None)¶

get_list(name)¶: Return a list of values

index(name)¶: Return the index of the first occurance of given name

iter_bytes()¶

iter_str()[source]¶

join_multilines(value, lines)¶: Scan for multiline value which is prefixed with a space or tab

keys()¶

list()¶: Return the underlying list

classmethod parse(s, newline='\r\n')[source]¶

status_code¶

values()¶

warcat.model.field.HTTPHeaders¶

Deprecated since version 2.1.1.

Name uses wrong inflection. Use HTTPHeader instead.

alias of HTTPHeader

class warcat.model.field.Header(version='1.0', fields=None)[source]¶

A header of a WARC Record.

version¶: A str containing the version

fields¶: The Fields object.

VERSION = '1.0'¶

iter_bytes()[source]¶

iter_str()[source]¶

classmethod parse(b)[source]¶: Parse from bytes and return Header

A WARC record

class warcat.model.record.Record(header=None, content_block=None)[source]¶

A WARC Record within a WARC file.

header¶: Header

content_block¶: A BinaryBlock or BlockWithPayload

file_offset¶: If this record was loaded from a file, this attribute contains an int describing the location of the record in the file.

content_length¶

date¶

iter_bytes()[source]¶

classmethod load(file_obj, preserve_block=False, check_block_length=True)[source]¶

Parse and return a Record

Parameters:

file_object – A file-like object.
preserve_block – If True, content blocks are not parsed for fields and payloads. Enabling this feature ensures preservation of content length and hash digests.
check_block_length – If True, the length of the blocks are checked to a serialized version by Warcat. This can be useful for checking whether Warcat will output blocks with correct whitespace.

record_id¶

warc_type¶

WARC model starting point

class warcat.model.warc.WARC[source]¶

A Web ARChive file model.

Typically, large streaming operations should use open() and read_record() functions.

iter_bytes()[source]¶

load(filename)[source]¶

Open and load the contents of the given filename.

The records are located in records.

classmethod open(filename, force_gzip=False)[source]¶

Return a logical file object.

Parameters:	filename – The path of the file. gzip compression is detected using file extension. force_gzip – Use gzip compression always.

read_file_object(file_object)[source]¶: Read records until the file object is exhausted

classmethod read_record(file_object, preserve_block=False, check_block_length=True)[source]¶

Return a record and whether there are more records to read.

See also

Record

Returns:	A tuple. The first item is the `Record`. The second item is a boolean indicating whether there are more records to be read.

Archive process tools

class warcat.tool.BaseIterateTool(filenames, out_file=None, write_gzip=False, force_read_gzip=None, read_record_ids=None, preserve_block=True, out_dir=None, print_progress=False, keep_going=False)[source]¶

Base class for iterating through records

action(record)[source]¶

postprocess()[source]¶

preprocess()[source]¶

process()[source]¶

class warcat.tool.ConcatTool(filenames, out_file=None, write_gzip=False, force_read_gzip=None, read_record_ids=None, preserve_block=True, out_dir=None, print_progress=False, keep_going=False)[source]¶

action(record)[source]¶

postprocess()¶

preprocess()[source]¶

process()¶

class warcat.tool.ExtractTool(filenames, out_file=None, write_gzip=False, force_read_gzip=None, read_record_ids=None, preserve_block=True, out_dir=None, print_progress=False, keep_going=False)[source]¶

action(record)[source]¶

postprocess()¶

preprocess()¶

process()¶

class warcat.tool.ListTool(filenames, out_file=None, write_gzip=False, force_read_gzip=None, read_record_ids=None, preserve_block=True, out_dir=None, print_progress=False, keep_going=False)[source]¶

action(record)[source]¶

postprocess()¶

preprocess()¶

process()¶

class warcat.tool.SplitTool(filenames, out_file=None, write_gzip=False, force_read_gzip=None, read_record_ids=None, preserve_block=True, out_dir=None, print_progress=False, keep_going=False)[source]¶

action(record)[source]¶

postprocess()¶

preprocess()¶

process()¶

exception warcat.tool.VerifyProblem(message, iso_section=None, major=True)[source]¶

args¶

iso_section¶

major¶

message¶

with_traceback()¶: Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class warcat.tool.VerifyTool(filenames, out_file=None, write_gzip=False, force_read_gzip=None, read_record_ids=None, preserve_block=True, out_dir=None, print_progress=False, keep_going=False)[source]¶

MANDATORY_FIELDS = ['WARC-Record-ID', 'Content-Length', 'WARC-Date', 'WARC-Type']¶

action(record)[source]¶

check_transfer_encoding(record)[source]¶

postprocess()¶

preprocess()[source]¶

process()¶

verify_block_digest(record)[source]¶

verify_concurrent_to(record)[source]¶

verify_content_type(record)[source]¶

verify_filename(record)[source]¶

verify_id_no_whitespace(record)[source]¶

verify_id_uniqueness(record)[source]¶

verify_mandatory_fields(record)[source]¶

verify_payload_digest(record)[source]¶

verify_profile(record)[source]¶

verify_refers_to(record)[source]¶

verify_segment_origin_id(record)[source]¶

verify_segment_total_length(record)[source]¶

verify_target_uri(record)[source]¶

verify_warcinfo_id(record)[source]¶

Version info

warcat.version.short_version = '2.2'¶: Short version in the form of N.N

Verification helpers

warcat.verify.parse_digest_field(s)[source]¶: Return the algorithm name and digest bytes

warcat.verify.verify_block_digest(record)[source]¶: Return True if the content block hash digest is valid

warcat.verify.verify_payload_digest(record)[source]¶: Return True if the payload hash digest is valid

Utility functions

class warcat.util.DiskBufferedReader(raw, disk_buffer_size=104857600, spool_size=10485760)[source]¶

Buffers the file to disk large parts at a time

close()¶

Flush and close the IO object.

This method has no effect if the file is already closed.

closed¶

detach()¶

Disconnect this buffer from its underlying raw stream and return it.

After the raw stream has been detached, the buffer is in an unusable state.

fileno()[source]¶

flush()¶

Flush write buffers, if applicable.

This is not implemented for read-only and non-blocking streams.

isatty()[source]¶

mode¶

name¶

peek(n=0)[source]¶

raw¶

read(n=None)[source]¶

read1()¶

Read and return up to n bytes, with at most one read() call to the underlying raw stream. A short result does not imply that EOF is imminent.

Returns an empty bytes object on EOF.

readable()[source]¶

readinto()¶

readinto1()¶

readline()¶

Read and return a line from the stream.

If size is specified, at most size bytes will be read.

The line terminator is always b’n’ for binary files; for text files, the newlines argument to open can be used to select the line terminator(s) recognized.

readlines()¶

Return a list of lines from the stream.

hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.

seek(pos, whence=0)[source]¶

seekable()[source]¶

tell()[source]¶

truncate()¶

Truncate file to size bytes.

File pointer is left unchanged. Size defaults to the current IO position as reported by tell(). Returns the new size.

writable()[source]¶

write()¶

Write the given buffer to the IO stream.

Returns the number of bytes written, which is always the length of b in bytes.

Raises BlockingIOError if the buffer is full and the underlying raw stream cannot accept more data at the moment.

writelines()¶

class warcat.util.FileCache(size=4)[source]¶

A cache containing references to file objects.

File objects are closed when expired. Class is thread safe and will only return file objects belonging to its own thread.

get(filename)[source]¶

put(filename, file_obj)[source]¶

class warcat.util.HTTPSocketShim[source]¶

close()¶: Disable all I/O operations.

closed¶: True if the file is closed.

detach()¶

Disconnect this buffer from its underlying raw stream and return it.

After the raw stream has been detached, the buffer is in an unusable state.

fileno()¶

Returns underlying file descriptor if one exists.

OSError is raised if the IO object does not use a file descriptor.

flush()¶: Does nothing.

getbuffer()¶: Get a read-write view over the contents of the BytesIO object.

getvalue()¶: Retrieve the entire contents of the BytesIO object.

isatty()¶

Always returns False.

BytesIO objects are not connected to a TTY-like device.

makefile(*args, **kwargs)[source]¶

read()¶

Read at most size bytes, returned as a bytes object.

If the size argument is negative, read until EOF is reached. Return an empty bytes object at EOF.

read1()¶

Read at most size bytes, returned as a bytes object.

If the size argument is negative or omitted, read until EOF is reached. Return an empty bytes object at EOF.

readable()¶: Returns True if the IO object can be read.

readinto()¶

Read bytes into buffer.

Returns number of bytes read (0 for EOF), or None if the object is set not to block and has no data to read.

readinto1()¶

readline()¶

Next line from the file, as a bytes object.

Retain newline. A non-negative size argument limits the maximum number of bytes to return (an incomplete line may be returned then). Return an empty bytes object at EOF.

readlines()¶

List of bytes objects, each a line from the file.

Call readline() repeatedly and return a list of the lines so read. The optional size argument, if given, is an approximate bound on the total number of bytes in the lines returned.

seek()¶

Change stream position.

Seek to byte offset pos relative to position indicated by whence:: 0 Start of stream (the default). pos should be >= 0; 1 Current position - pos may be negative; 2 End of stream - pos usually negative.

Returns the new absolute position.

seekable()¶: Returns True if the IO object can be seeked.

tell()¶: Current file position, an integer.

truncate()¶

Truncate the file to at most size bytes.

Size defaults to the current file position, as returned by tell(). The current file position is unchanged. Returns the new size.

writable()¶: Returns True if the IO object can be written.

write()¶

Write bytes to file.

Return the number of bytes written.

writelines()¶

Write lines to the file.

Note that newlines are not added. lines can be any iterable object producing bytes-like objects. This is equivalent to calling write() for each element.

warcat.util.append_index_filename(path)[source]¶

Adds _index_xxxxxx to the path.

It uses the basename aka filename of the path to generate the hex hash digest suffix.

warcat.util.copyfile_obj(source, dest, bufsize=4096, max_length=None, write_attr_name='write')[source]¶: Like shutil.copyfileobj() but with limit on how much to copy

warcat.util.file_cache = <warcat.util.FileCache object>¶: The FileCache instance

warcat.util.find_file_pattern(file_obj, pattern, bufsize=512, limit=4096, inclusive=False)[source]¶: Find the offset from current position of pattern

warcat.util.parse_http_date(s)[source]¶

warcat.util.parse_http_response(file_obj)[source]¶: Parse and return http.client.HTTPResponse

warcat.util.printable_str_to_str(s)[source]¶

warcat.util.rename_filename_dirs(dest_filename)[source]¶

Renames files if they conflict with a directory in given path.

If a file has the same name as the directory, the file is renamed using append_index_filename().

warcat.util.sanitize_str(s)[source]¶: Replaces unsavory chracters from string with an underscore

warcat.util.split_url_to_filename(s)[source]¶: Attempt to split a URL to a filename on disk

warcat.util.strip_warc_extension(s)[source]¶: Removes .warc or .warc.gz from filename

warcat.util.truncate_filename_parts(path_parts, length=160)[source]¶

Truncate and suffix filename path parts if they exceed the given length.

If the filename part is too long, the part is truncated and an underscore plus a 6 letter hex (_xxxxxx) suffix is appended.