API Reference¶
Document model
Model serialization and binary references
-
class
warcat.model.binary.
BytesSerializable
[source]¶ Metaclass that indicates this object can be serialized to bytes
-
class
warcat.model.binary.
StrSerializable
[source]¶ Metaclass that indicates this object can be serialized to str
-
class
warcat.model.binary.
BinaryFileRef
[source]¶ Reference to a file containing the content block data.
-
file_offset
¶ When reading, the file is seeked to file_offset.
-
length
¶ The length of the data
-
filename
¶ The filename of the referenced data. It must be a valid file.
-
file_obj
¶ The file object to be read from. It is important that this file object is not shared or race conditions will occur. File objects are not closed automatically.
-
get_file
(safe=True, spool_size=10485760)[source]¶ Return a file object with the data.
Parameters: safe – If True, return a new file object that is a copy of the data. You will be responsible for closing the file.
Otherwise, it will be the original file object that is seeked to the correct offset. Be sure to not read beyond its length and seek back to the original position if necessary.
-
Content blocks and payload blocks
-
class
warcat.model.block.
ContentBlock
[source]¶ -
iter_bytes
()¶ Return an iterable of bytes
-
classmethod
load
(file_obj, length, content_type)[source]¶ Load and return
BinaryBlock
orBlockWithPayload
-
-
class
warcat.model.block.
BinaryBlock
[source]¶ A content block that is octet data
-
get_file
(safe=True, spool_size=10485760)¶ Return a file object with the data.
Parameters: safe – If True, return a new file object that is a copy of the data. You will be responsible for closing the file.
Otherwise, it will be the original file object that is seeked to the correct offset. Be sure to not read beyond its length and seek back to the original position if necessary.
-
iter_file
(buffer_size=4096)¶ Return an iterable of bytes of the source data
-
classmethod
load
(file_obj, length)[source]¶ Return a
BinaryBlock
using given file object
-
set_file
(file, offset=0, length=None)¶ Set the reference to the file or filename with the data.
This is a convenience function to setting the attributes individually.
-
-
class
warcat.model.block.
BlockWithPayload
(fields=None, payload=None)[source]¶ A content block (fields/data) within a
Record
.-
fields
¶ Fields
-
binary_block
¶ If this block was loaded from a file, this attribute will be a
BinaryBlock
of the original file. Otherwise, this attribute is None.
-
length
¶ Return the new computed length
-
classmethod
load
(file_obj, length, field_cls)[source]¶ Return a
BlockWithPayload
Parameters: - file_obj – The file object
- length – How much to read from the file
- field_cls – The class or subclass of
Fields
-
-
class
warcat.model.block.
Payload
[source]¶ Data within a content block that has fields
-
get_file
(safe=True, spool_size=10485760)¶ Return a file object with the data.
Parameters: safe – If True, return a new file object that is a copy of the data. You will be responsible for closing the file.
Otherwise, it will be the original file object that is seeked to the correct offset. Be sure to not read beyond its length and seek back to the original position if necessary.
-
iter_file
(buffer_size=4096)¶ Return an iterable of bytes of the source data
-
set_file
(file, offset=0, length=None)¶ Set the reference to the file or filename with the data.
This is a convenience function to setting the attributes individually.
-
Constants and things
-
warcat.model.common.
FIELD_DELIM_BYTES
= b'\r\n\r\n'¶ Bytes CR LF CR LF
-
warcat.model.common.
NEWLINE
= '\r\n'¶ String CR LF
-
warcat.model.common.
NEWLINE_BYTES
= b'\r\n'¶ Bytes CR LF
Named fields
-
class
warcat.model.field.
Fields
(field_list=None)[source]¶ Name and value pseudo-map list
Behaves like a dict or mutable mapping. Mutable mapping operations remove any duplicates in the field list.
-
class
warcat.model.field.
HTTPHeader
(field_list=None, status=None)[source]¶ Fields extended with a HTTP status attribute.
-
status
¶ The str of the HTTP status message and code.
-
add
(name, value)¶ Append a name-value field to the list
-
clear
()¶
-
count
(name)¶ Count the number of times this name occurs in the list
-
get
(name, default=None)¶
-
get_list
(name)¶ Return a list of values
-
index
(name)¶ Return the index of the first occurance of given name
-
iter_bytes
()¶
-
join_multilines
(value, lines)¶ Scan for multiline value which is prefixed with a space or tab
-
keys
()¶
-
list
()¶ Return the underlying list
-
status_code
¶
-
values
()¶
-
-
warcat.model.field.
HTTPHeaders
¶ Deprecated since version 2.1.1.
Name uses wrong inflection. Use
HTTPHeader
instead.alias of
HTTPHeader
-
class
warcat.model.field.
Header
(version='1.0', fields=None)[source]¶ A header of a WARC Record.
-
version
¶ A str containing the version
-
VERSION
= '1.0'¶
-
A WARC record
-
class
warcat.model.record.
Record
(header=None, content_block=None)[source]¶ A WARC Record within a WARC file.
-
header
¶ Header
-
content_block
¶ A
BinaryBlock
orBlockWithPayload
-
file_offset
¶ If this record was loaded from a file, this attribute contains an int describing the location of the record in the file.
-
content_length
¶
-
date
¶
-
classmethod
load
(file_obj, preserve_block=False, check_block_length=True)[source]¶ Parse and return a
Record
Parameters: - file_object – A file-like object.
- preserve_block – If True, content blocks are not parsed for fields and payloads. Enabling this feature ensures preservation of content length and hash digests.
- check_block_length – If True, the length of the blocks are checked to a serialized version by Warcat. This can be useful for checking whether Warcat will output blocks with correct whitespace.
-
record_id
¶
-
warc_type
¶
-
WARC model starting point
-
class
warcat.model.warc.
WARC
[source]¶ A Web ARChive file model.
Typically, large streaming operations should use
open()
andread_record()
functions.-
load
(filename)[source]¶ Open and load the contents of the given filename.
The records are located in
records
.
-
Archive process tools
-
class
warcat.tool.
BaseIterateTool
(filenames, out_file=None, write_gzip=False, force_read_gzip=None, read_record_ids=None, preserve_block=True, out_dir=None, print_progress=False, keep_going=False)[source]¶ Base class for iterating through records
-
class
warcat.tool.
ConcatTool
(filenames, out_file=None, write_gzip=False, force_read_gzip=None, read_record_ids=None, preserve_block=True, out_dir=None, print_progress=False, keep_going=False)[source]¶ -
-
postprocess
()¶
-
process
()¶
-
-
class
warcat.tool.
ExtractTool
(filenames, out_file=None, write_gzip=False, force_read_gzip=None, read_record_ids=None, preserve_block=True, out_dir=None, print_progress=False, keep_going=False)[source]¶ -
-
postprocess
()¶
-
preprocess
()¶
-
process
()¶
-
-
class
warcat.tool.
ListTool
(filenames, out_file=None, write_gzip=False, force_read_gzip=None, read_record_ids=None, preserve_block=True, out_dir=None, print_progress=False, keep_going=False)[source]¶ -
-
postprocess
()¶
-
preprocess
()¶
-
process
()¶
-
-
class
warcat.tool.
SplitTool
(filenames, out_file=None, write_gzip=False, force_read_gzip=None, read_record_ids=None, preserve_block=True, out_dir=None, print_progress=False, keep_going=False)[source]¶ -
-
postprocess
()¶
-
preprocess
()¶
-
process
()¶
-
-
exception
warcat.tool.
VerifyProblem
(message, iso_section=None, major=True)[source]¶ -
args
¶
-
iso_section
¶
-
major
¶
-
message
¶
-
with_traceback
()¶ Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
-
-
class
warcat.tool.
VerifyTool
(filenames, out_file=None, write_gzip=False, force_read_gzip=None, read_record_ids=None, preserve_block=True, out_dir=None, print_progress=False, keep_going=False)[source]¶ -
MANDATORY_FIELDS
= ['WARC-Record-ID', 'Content-Length', 'WARC-Date', 'WARC-Type']¶
-
postprocess
()¶
-
process
()¶
-
Version info
-
warcat.version.
short_version
= '2.2'¶ Short version in the form of N.N
Verification helpers
-
warcat.verify.
verify_block_digest
(record)[source]¶ Return True if the content block hash digest is valid
-
warcat.verify.
verify_payload_digest
(record)[source]¶ Return True if the payload hash digest is valid
Utility functions
-
class
warcat.util.
DiskBufferedReader
(raw, disk_buffer_size=104857600, spool_size=10485760)[source]¶ Buffers the file to disk large parts at a time
-
close
()¶ Flush and close the IO object.
This method has no effect if the file is already closed.
-
closed
¶
-
detach
()¶ Disconnect this buffer from its underlying raw stream and return it.
After the raw stream has been detached, the buffer is in an unusable state.
-
flush
()¶ Flush write buffers, if applicable.
This is not implemented for read-only and non-blocking streams.
-
mode
¶
-
name
¶
-
raw
¶
-
read1
()¶ Read and return up to n bytes, with at most one read() call to the underlying raw stream. A short result does not imply that EOF is imminent.
Returns an empty bytes object on EOF.
-
readinto
()¶
-
readinto1
()¶
-
readline
()¶ Read and return a line from the stream.
If size is specified, at most size bytes will be read.
The line terminator is always b’n’ for binary files; for text files, the newlines argument to open can be used to select the line terminator(s) recognized.
-
readlines
()¶ Return a list of lines from the stream.
hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.
-
truncate
()¶ Truncate file to size bytes.
File pointer is left unchanged. Size defaults to the current IO position as reported by tell(). Returns the new size.
-
write
()¶ Write the given buffer to the IO stream.
Returns the number of bytes written, which is always the length of b in bytes.
Raises BlockingIOError if the buffer is full and the underlying raw stream cannot accept more data at the moment.
-
writelines
()¶
-
-
class
warcat.util.
FileCache
(size=4)[source]¶ A cache containing references to file objects.
File objects are closed when expired. Class is thread safe and will only return file objects belonging to its own thread.
-
class
warcat.util.
HTTPSocketShim
[source]¶ -
close
()¶ Disable all I/O operations.
-
closed
¶ True if the file is closed.
-
detach
()¶ Disconnect this buffer from its underlying raw stream and return it.
After the raw stream has been detached, the buffer is in an unusable state.
-
fileno
()¶ Returns underlying file descriptor if one exists.
OSError is raised if the IO object does not use a file descriptor.
-
flush
()¶ Does nothing.
-
getbuffer
()¶ Get a read-write view over the contents of the BytesIO object.
-
getvalue
()¶ Retrieve the entire contents of the BytesIO object.
-
isatty
()¶ Always returns False.
BytesIO objects are not connected to a TTY-like device.
-
read
()¶ Read at most size bytes, returned as a bytes object.
If the size argument is negative, read until EOF is reached. Return an empty bytes object at EOF.
-
read1
()¶ Read at most size bytes, returned as a bytes object.
If the size argument is negative or omitted, read until EOF is reached. Return an empty bytes object at EOF.
-
readable
()¶ Returns True if the IO object can be read.
-
readinto
()¶ Read bytes into buffer.
Returns number of bytes read (0 for EOF), or None if the object is set not to block and has no data to read.
-
readinto1
()¶
-
readline
()¶ Next line from the file, as a bytes object.
Retain newline. A non-negative size argument limits the maximum number of bytes to return (an incomplete line may be returned then). Return an empty bytes object at EOF.
-
readlines
()¶ List of bytes objects, each a line from the file.
Call readline() repeatedly and return a list of the lines so read. The optional size argument, if given, is an approximate bound on the total number of bytes in the lines returned.
-
seek
()¶ Change stream position.
- Seek to byte offset pos relative to position indicated by whence:
- 0 Start of stream (the default). pos should be >= 0; 1 Current position - pos may be negative; 2 End of stream - pos usually negative.
Returns the new absolute position.
-
seekable
()¶ Returns True if the IO object can be seeked.
-
tell
()¶ Current file position, an integer.
-
truncate
()¶ Truncate the file to at most size bytes.
Size defaults to the current file position, as returned by tell(). The current file position is unchanged. Returns the new size.
-
writable
()¶ Returns True if the IO object can be written.
-
write
()¶ Write bytes to file.
Return the number of bytes written.
-
writelines
()¶ Write lines to the file.
Note that newlines are not added. lines can be any iterable object producing bytes-like objects. This is equivalent to calling write() for each element.
-
-
warcat.util.
append_index_filename
(path)[source]¶ Adds
_index_xxxxxx
to the path.It uses the basename aka filename of the path to generate the hex hash digest suffix.
-
warcat.util.
copyfile_obj
(source, dest, bufsize=4096, max_length=None, write_attr_name='write')[source]¶ Like
shutil.copyfileobj()
but with limit on how much to copy
-
warcat.util.
find_file_pattern
(file_obj, pattern, bufsize=512, limit=4096, inclusive=False)[source]¶ Find the offset from current position of pattern
-
warcat.util.
rename_filename_dirs
(dest_filename)[source]¶ Renames files if they conflict with a directory in given path.
If a file has the same name as the directory, the file is renamed using
append_index_filename()
.