avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-27) Consistent Overhead Byte Stuffing (COBS) encoded block format for Object Container Files
Date Thu, 07 May 2009 08:03:31 GMT

    [ https://issues.apache.org/jira/browse/AVRO-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706754#action_12706754

Scott Carey commented on AVRO-27:

An outsider here -- I've got an idea on how to avoid the performance pitfalls of COBS' byte-by-byte
nature and as I thought through it, I spotted many other opportunities for enhancement since
larger chunks afford a lot more bits in the Code that can be used for things other than the
length of the following literal chunk.

h2. Proposal -- COLS, a modification of COBS 
h3. (for greater performance and extensibility for large data streams) 

Java is particularly bad at byte-by-byte operations.  The COBS paper clearly indicates its
design intention was stuffing data through embedded systems such as telephone lines and other
networks where byte-by-byte processing of the whole payload is already mandatory.

Doing so here would be a performance bottleneck in Java.  Some simple tests can be constructed
to prove or disprove this claim.

I propose that rather than use COBS, one uses COLS or COWS ... that is Constant Overhead Long
Stuffing or Constant Overhead Word Stuffing instead.

This would be inefficient if we expect most payloads to be small (less than 256 bytes), but
I suspect most hadoop related payloads to be large, and often very large.

I favor stuffing Longs rather than Ints, since most systems will soon be running 64 bit JVMs
in the future.  Sun's next JRE release has Object Pointer Compression, which makes the memory
overhead of a 64 bit JVM very small compared to a 32 bit JVM, and performance is generally
faster than the 32 bit JVM due to native 64 bit operations and more registers (for x86-64
at least).

I will describe the proposal below assuming a translation of COBS to COLS, from 1 byte at
a time to 8 byte at a time encoding.  However, it is clear that a 4 byte variant is very similar
and may be preferable.

h3. Proposed Changes -- Simple Block format with COLS

|| name || format || length in bytes || value || meaning ||
| sync | byte | *8* | *0L* | The sync *long* serves as a clear marker for the start of a block
| type | 1 byte | 1 | non-zero | The type field expresses whether the block is for _metadata_
or _normal_ data. *note - if this is only ever going to be a binary flag, it can be packed
into the length or sequence number as a sign value.*   *However, it is decoding performance
critical to keep the non-COLS header 8 byte aligned*|
| block sequence number | 3 byte unsigned int | 3 bytes | 0 - 2^24 | the block sequence number
-- a client can use this to resume a stream from the last successful block. This may not be
needed if the metadata blocks take care of this. | 
| length | fixed 4 byte signed int | variable | *>= 0L* | The length field expresses the
number of bytes of COLS_payload data in bytes.  Useful for skipping ahead to the next block.
| COLS_payload | *COLS* | *length as above* | see COLS description below | The data in this
block, encoded. |

The above would put cap the stream length to 2GB * 16M = 32PB.  There is room to increase
this significantly by taking bits from the type and giving those to the block count.  2GB
blocks are rather unlikely for now however -- as is multi-PB streams. 

h4. Discussion
* The entire stream would need to be 8 byte aligned in order to process it cleanly with something
like java.nio.LongBuffer.  This would include metadata blocks.
* The sequence is assumed to be in network-order.   Endianness can be handled and is not discussed
in detail here.
* The type can likely be encoded in a single bit in the block sequence number or length field.
 If more than two types of blocks are expected, more bits can be reserved for future use.
* The length can be stored as the number of longs rather than bytes (bytes / 8) since the
COLS payload is a multiple of 8 bytes.
* The COLS payload here differs from the original proposal.  It will have an entire COBS-like
stream, with possibly many COLS code markers (at least one per 0L value in the block data).
* One may want to have both the encoded length above, and the decoded length (or a checksum)
as extra data validation.  Perhaps even 4 types:  METADATA, METADATA_CSUM, NORMAL, NORMAL_CSUM
-- where  the ordinary variants store the length (fast, but less reliable) and the _CSUM variants
store a checksum (slower, but highly reliable).

h3. Basic COBS to COLS description
COBS describes a byte-by-byte encoding where a zero byte cannot exist, and a set of codes
are used to encode runs of data that does not contain a zero byte.  All codes but but one
have an implicit trailing zero.  The last block is assumed to have no implicit zero regardless
of the code.

COLS is a simple extension of this scheme to 64 bit chunks.  In its base form, it does nothing
more than work with larger chunks:

|| COLS Code (Long, 8 bytes) || Followed by || Meaning || 
| 0L | N/A | (not allowed)|
| 1L | nothing | A single zero Long |
| 2L | one long (8 bytes) | The single data long, followed by a trailing zero long * |
| 3L | two longs (16 bytes) | The pair of data longs, followed by a trailing zero long * |
| nL | (n-1) longs | The (n-1) longs, followed by a trailing zero long * |
| MAX \*\* | MAX - 1 longs | MAX -1 longs, with no trailing zero |

\* The last code in the sequence (which can be identified by the length header or a 0L indicating
the start of the next block) does NOT have an implicit trailing zero. 
\*\* MAX needs to be chosen, and can't realistically be very large since encoding requires
an arraycopy of size (MAX -1) * 8

The COLS_payload has multiple COLS Code entries (and literals), up to the length specified
in the header (where a 0L should then occur).

However -- there are drawbacks to using such a large chunk without other modifications from
# 64 bits is far too large for a length field.  For encoding, a COBS code block must fit in
RAM, and for performance, should probably fit in half an L2 cache.  However, for decoding
COLS code length is irrelevant.
# If the size of the data encoded is not a multiple of 8 bytes, we need a mechanism to encode
that up to 7 trailing bytes should be truncated (3 bits).
# For most blocks, the overhead will be exactly 8 bytes (unless the block has a trailing 0L).
# Very long data streams without a zero Long are unlikely, so very large chunk lengths are
not very useful. 

There are also benefits however. The above suggests that most of the 8 byte COLS code block
space is not needed to encode length.  Much can be done with this!
Some thoughts:
* The 3 bits needed to define the truncation behavior can be stored in the COLS code.
* The overhead can be reduced, by encoding short trailing sequences into the upper bits rather
than purely truncating -- e.g. you can append 2 bytes instead of truncating 6.
* Rudimentary run-length encoding or other light weight compression can be done with the extra
bits (completely encoder-optional).
* We can remove the requirement that most codes have an implicit trailing zero, and encode
that in one of the extra bits.

If only the lower 2 bytes of an 8 byte COLS code represent the size, (MAX = 2^16 - 1), then
the max literal size is 512KB - 8B.  If we remove the implicit trailing zero, an encoder can
optionally encode smaller literal sequences (perhaps for performance, or compression).
What can be done with the remaining 48 bits?
Some ideas:
# The highest 4 bytes can represent data to append to the literal.  In this way, half of the
size overhead of the encoding is removed.  This should generally only apply to the last COLS
code in the block (for performance reasons and maintaining 8 byte alignment on all arraycopy
operations, but its encoder optional).
# the next bit represents whether the COLS block has an implicit 0L appended.
# a bit can be used to signify endianness (this might be a better fit for the Block header
or stream metadata -- detecting zero's works without known endianness)
# The next three bits can represent how much data is truncated or appended to the literal,
(before the optional implicit 0L): 

|| value || meaning ||
| 000 | do not truncate or append |
| 100 | append all 4 leading bytes in the COLS code after the literal |
| 111 | append the first 3 leading bytes in the COLS code after the literal |
| 110 | append the first 2 leading bytes in the COLS code after the literal |
| 101 | append the leading byte in the COLS code after the literal |
| 011 | truncate the last 3 bytes of the literal |
| 010 | truncate the last 2 bytes of the literal |
| 001 | truncate the last byte of the literal |

This leaves us with 12 bits.  I propose that these be used for rudimentary (optional) compression:

* Option A:
** Run length only -- the 12 bits represent the number of times to repeat the literal.  Or
4 bits are the number of COLS chunks backwards (including this one) to repeat, and 8 bits
is the number of repeats.  Or ... some other form of emitting copies of entire COLS chunks.
* Option B:
** Some form of LZ-like compression that copies in 8 byte chunks -- 4 bits represent the number
of Longs to copy (so, max match size is 15 * 8 bytes), and 8 bits represents the number of
Longs backwards (from the end of this COLS chunk) to begin that copy (up to 2KB).  Because
of the truncation/append feature, this is not constrained to 8-byte aligned copies on the
output, but the encoded format is entirely 8 byte aligned and all copies are multiples of
8 bytes.  I would not be surprised if this was as fast as LZO or faster, since it is very
similar but operates in a more chunky fashion.  Compression levels would not be that great,
but like most similar algorithms to this the encoder can do more work to search for matches.
 Decoding uncompressed data should be essentially free (if the 4 bits are 0, do nothing --
and most COLS blocks would be fairly large so this check does not occur that frequently).
* Option C:
** Reserve those 12 bits for future use / research

Alternatively, one to 4 extra bytes used for the "append" feature can be reassigned to have
more than 12 bits for compression metadata. 

So, with the above modifications, the COLS code  looks like this:

The COLS code is 8 bytes.  The low 16 bits encode basic meaning.
An 8 byte COLS code cannot be 0L.
|| Code & 0xFFFF (low 2 bytes) || Followed by || Meaning || 
| 0x0000 | N/A | (not allowed)|
| 0x0001 | nothing | A single zero Long |
| 0x0002 | one long (8 bytes) | The single data long |
| 0x0003 | two longs (16 bytes) | The pair of data longs |
| n | (n-1) longs |  The (n-1) longs |
| 0xFFFF | 2^16 - 2 longs | 2^16 - 2 longs |

The next portion, is to determine the state of truncation or appending. 
Two options are listed -- only truncation, and truncation/appending.  The appending could
be up to 5 bytes if we squeeze all the rest of the space.  The example below is for up to
4 bytes appended and 3 bytes truncated.

{code}appendCode = (Code >> 28) & 0xF;{code}

|| appendCode & 0x7 || Append (+) or truncate (-) || From || truncate only option ||
| 0x0 | 0 | nothing| 0 |
| 0x1 | (-)1 | nothing | (-)1 |
| 0x2 | (-)2 | nothing | (-)2 |
| 0x3 | (-)3 | nothing | (-)3 |
| 0x4 | (+)1 | Code >>> 63 | (-)4 |
| 0x5 | (+)2 | Code >>> 62 | (-)5 |
| 0x6 | (+)3 | Code >>> 61 | (-)6 |
| 0x7 | (+)4 | Code >>> 60 | (-)7 |

It may be wiser to choose an option between these.  If 3 bytes are chosen as the max arbitrary
append length, with 4 truncated, 20 bits are left for other purposes, rather than 12.  The
average COLS chunk would be one byte larger.

|| AppendCode & 0x8 || Append 0L ||
| 0 | do not append 0L (8 zero bytes) |
| 1 | do append 0L (8 zero bytes) | 
h2. Encoding
The writer would perform processing in 8 byte chunks until the end of the block where some
byte-by-byte processing would occur. Compression options would be entirely the writer's choice.
The state overhead can be very low or large at the writer's whim.  Larger COLS chunk sizes
require more state (and larger arraycopys), and any compression option adds state overhead.

h2. Decoding
Decoding in all circumstances reads data in 8 byte chunks.  Copies occur in 8 byte chunks,
8 byte aligned save for the end of a block if the block does not have a multiple of 8 bytes
in its payload.  An encoder can cause copy destinations (but not sources) to not be 8 byte
aligned if certain special options (compression) or intentionally misaligned encoding is done.
  Generally, an encoder can choose to make all but the last few bytes of the last block in
the stream aligned. 

> Consistent Overhead Byte Stuffing (COBS) encoded block format for Object Container Files
> ----------------------------------------------------------------------------------------
>                 Key: AVRO-27
>                 URL: https://issues.apache.org/jira/browse/AVRO-27
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Matt Massie
> Object Container Files could use a 1 byte sync marker (set to zero) using zig-zag and
COBS encoding within blocks to efficiently escape zeros from the record data.
> h4. Zig-Zag encoding
> With zig-zag encoding only the value of 0 (zero) gets encoded into a value with a single
zero byte.  This property means that we can write any non-zero zig-zag long inside a block
within concern for creating an unintentional sync byte. 
> h4. COBS encoding
> We'll use COBS encoding to ensure that all zeros are escaped inside the block payload.
 You can read http://www.sigcomm.org/sigcomm97/papers/p062.pdf for the details about COBS
> h1. Block Format
> All blocks start and end with a sync byte (set to zero) with a type-length-value format
internally as follows:
> || name || format || length in bytes || value || meaning ||
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker for the
start of a block |
> | type | zig-zag long | variable | must be non-zero | The type field expresses whether
the block is for _metadata_ or _normal_ data. |
> | length | zig-zag long | variable | must be non-zero | The length field expresses the
number of bytes until the next record (including the cobs code and sync byte).  Useful for
skipping ahead to the next block. |
> | cobs_code | byte | 1 | see COBS code table below | Used in escaping zeros from the
block payload |
> | payload | cobs-encoded | Greater than or equal to zero | all non-zero bytes | The payload
of the block |
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker for the
end of the block |
> h2. COBS code table 
> || Code || Followed by || Meaning | 
> | 0x00 | (not applicable) | (not allowed ) |
> | 0x01 | nothing | Empty payload followed by the closing sync byte |
> | 0x02 | one data byte | The single data byte, followed by the closing sync byte | 
> | 0x03 | two data bytes | The pair of data bytes, followed by the closing sync byte |
> | 0x04 | three data bytes | The three data bytes, followed by the closing sync byte |
> | n | (n-1) data bytes | The (n-1) data bytes, followed by the closing sync byte |
> | 0xFD | 252 data bytes | The 252 data bytes, followed by the closing sync byte |
> | 0xFE | 253 data bytes | The 253 data bytes, followed by the closing sync byte |
> | 0xFF | 254 data bytes | The 254 data bytes *not* followed by a zero. |
> (taken from http://www.sigcomm.org/sigcomm97/papers/p062.pdf)
> h1. Encoding
> Only the block writer needs to perform byte-by-byte processing to encode the block. 
The overhead for COBS encoding is very small in terms of the in-memory state required.
> h1. Decoding
> Block readers are not required to do as much byte-by-byte processing as a writer.  The
reader could (for example) find a _metadata_ block by doing the following:
> # Search for a zero byte in the file which marks the start of a record
> # Read and zig-zag decode the _type_ of the block
> #* If the block is _normal_ data, read the _length_, seek ahead to the next block and
goto step #2 again
> #* If the block is a _metadata_ block, cobs decode the data

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message