avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-27) Consistent Overhead Byte Stuffing (COBS) encoded block format for Object Container Files
Date Tue, 23 Jun 2009 19:58:07 GMT

    [ https://issues.apache.org/jira/browse/AVRO-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723273#action_12723273
] 

Scott Carey commented on AVRO-27:
---------------------------------

I agree, COBS-like encoding is only useful for streaming data where a specific character or
word must be avoided which is a format issue.

If all that is needed is identifying block boundaries, there are other methods.

A "magic number" approach can be collision proof by detecting the collision:  On encode, look
for the magic number and if present, follow it with a 'not at the end of the block' word;
at the end of the block place the magic number and a 'end of block' word.  On decode look
for the magic number and discard the following word, if the following word is the end of block
word also discard the magic word.  COBS came about because the worst case scenario for a magic
word approach is poor, and if the size of the magic word is small (one byte) the worst case
is likely.

This might prove very useful at some point.  Some of the general optimization findings here
will be useful somewhere.

> Consistent Overhead Byte Stuffing (COBS) encoded block format for Object Container Files
> ----------------------------------------------------------------------------------------
>
>                 Key: AVRO-27
>                 URL: https://issues.apache.org/jira/browse/AVRO-27
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Matt Massie
>         Attachments: COBSCodec.java, COBSCodec2.java, COBSPerfTest.java, COLSCodec.java,
COLSCodec2.java, COWSCodec.java, COWSCodec2.java, COWSCodec3.java
>
>
> Object Container Files could use a 1 byte sync marker (set to zero) using zig-zag and
COBS encoding within blocks to efficiently escape zeros from the record data.
> h4. Zig-Zag encoding
> With zig-zag encoding only the value of 0 (zero) gets encoded into a value with a single
zero byte.  This property means that we can write any non-zero zig-zag long inside a block
within concern for creating an unintentional sync byte. 
> h4. COBS encoding
> We'll use COBS encoding to ensure that all zeros are escaped inside the block payload.
 You can read http://www.sigcomm.org/sigcomm97/papers/p062.pdf for the details about COBS
encoding.
> h1. Block Format
> All blocks start and end with a sync byte (set to zero) with a type-length-value format
internally as follows:
> || name || format || length in bytes || value || meaning ||
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker for the
start of a block |
> | type | zig-zag long | variable | must be non-zero | The type field expresses whether
the block is for _metadata_ or _normal_ data. |
> | length | zig-zag long | variable | must be non-zero | The length field expresses the
number of bytes until the next record (including the cobs code and sync byte).  Useful for
skipping ahead to the next block. |
> | cobs_code | byte | 1 | see COBS code table below | Used in escaping zeros from the
block payload |
> | payload | cobs-encoded | Greater than or equal to zero | all non-zero bytes | The payload
of the block |
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker for the
end of the block |
> h2. COBS code table 
> || Code || Followed by || Meaning | 
> | 0x00 | (not applicable) | (not allowed ) |
> | 0x01 | nothing | Empty payload followed by the closing sync byte |
> | 0x02 | one data byte | The single data byte, followed by the closing sync byte | 
> | 0x03 | two data bytes | The pair of data bytes, followed by the closing sync byte |
> | 0x04 | three data bytes | The three data bytes, followed by the closing sync byte |
> | n | (n-1) data bytes | The (n-1) data bytes, followed by the closing sync byte |
> | 0xFD | 252 data bytes | The 252 data bytes, followed by the closing sync byte |
> | 0xFE | 253 data bytes | The 253 data bytes, followed by the closing sync byte |
> | 0xFF | 254 data bytes | The 254 data bytes *not* followed by a zero. |
> (taken from http://www.sigcomm.org/sigcomm97/papers/p062.pdf)
> h1. Encoding
> Only the block writer needs to perform byte-by-byte processing to encode the block. 
The overhead for COBS encoding is very small in terms of the in-memory state required.
> h1. Decoding
> Block readers are not required to do as much byte-by-byte processing as a writer.  The
reader could (for example) find a _metadata_ block by doing the following:
> # Search for a zero byte in the file which marks the start of a record
> # Read and zig-zag decode the _type_ of the block
> #* If the block is _normal_ data, read the _length_, seek ahead to the next block and
goto step #2 again
> #* If the block is a _metadata_ block, cobs decode the data

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message