Mailing-List: contact avro-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: avro-dev@hadoop.apache.org
Message-ID: <62307513.1245787087675.JavaMail.jira@brutus>
Date: Tue, 23 Jun 2009 12:58:07 -0700 (PDT)
From: "Scott Carey (JIRA)" <jira@apache.org>
To: avro-dev@hadoop.apache.org
Subject: [jira] Commented: (AVRO-27) Consistent Overhead Byte Stuffing
 (COBS) encoded block format for Object Container Files
In-Reply-To: <1863661385.1241640750458.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/AVRO-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723273#action_12723273 ] 

Scott Carey commented on AVRO-27:
---------------------------------

I agree, COBS-like encoding is only useful for streaming data where a specific character or word must be avoided which is a format issue.

If all that is needed is identifying block boundaries, there are other methods.

A "magic number" approach can be collision proof by detecting the collision:  On encode, look for the magic number and if present, follow it with a 'not at the end of the block' word; at the end of the block place the magic number and a 'end of block' word.  On decode look for the magic number and discard the following word, if the following word is the end of block word also discard the magic word.  COBS came about because the worst case scenario for a magic word approach is poor, and if the size of the magic word is small (one byte) the worst case is likely.

This might prove very useful at some point.  Some of the general optimization findings here will be useful somewhere.

> Consistent Overhead Byte Stuffing (COBS) encoded block format for Object Container Files
> ----------------------------------------------------------------------------------------
>
>                 Key: AVRO-27
>                 URL: https://issues.apache.org/jira/browse/AVRO-27
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Matt Massie
>         Attachments: COBSCodec.java, COBSCodec2.java, COBSPerfTest.java, COLSCodec.java, COLSCodec2.java, COWSCodec.java, COWSCodec2.java, COWSCodec3.java
>
>
> Object Container Files could use a 1 byte sync marker (set to zero) using zig-zag and COBS encoding within blocks to efficiently escape zeros from the record data.
> h4. Zig-Zag encoding
> With zig-zag encoding only the value of 0 (zero) gets encoded into a value with a single zero byte.  This property means that we can write any non-zero zig-zag long inside a block within concern for creating an unintentional sync byte. 
> h4. COBS encoding
> We'll use COBS encoding to ensure that all zeros are escaped inside the block payload.  You can read http://www.sigcomm.org/sigcomm97/papers/p062.pdf for the details about COBS encoding.
> h1. Block Format
> All blocks start and end with a sync byte (set to zero) with a type-length-value format internally as follows:
> || name || format || length in bytes || value || meaning ||
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker for the start of a block |
> | type | zig-zag long | variable | must be non-zero | The type field expresses whether the block is for _metadata_ or _normal_ data. |
> | length | zig-zag long | variable | must be non-zero | The length field expresses the number of bytes until the next record (including the cobs code and sync byte).  Useful for skipping ahead to the next block. |
> | cobs_code | byte | 1 | see COBS code table below | Used in escaping zeros from the block payload |
> | payload | cobs-encoded | Greater than or equal to zero | all non-zero bytes | The payload of the block |
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker for the end of the block |
> h2. COBS code table 
> || Code || Followed by || Meaning | 
> | 0x00 | (not applicable) | (not allowed ) |
> | 0x01 | nothing | Empty payload followed by the closing sync byte |
> | 0x02 | one data byte | The single data byte, followed by the closing sync byte | 
> | 0x03 | two data bytes | The pair of data bytes, followed by the closing sync byte |
> | 0x04 | three data bytes | The three data bytes, followed by the closing sync byte |
> | n | (n-1) data bytes | The (n-1) data bytes, followed by the closing sync byte |
> | 0xFD | 252 data bytes | The 252 data bytes, followed by the closing sync byte |
> | 0xFE | 253 data bytes | The 253 data bytes, followed by the closing sync byte |
> | 0xFF | 254 data bytes | The 254 data bytes *not* followed by a zero. |
> (taken from http://www.sigcomm.org/sigcomm97/papers/p062.pdf)
> h1. Encoding
> Only the block writer needs to perform byte-by-byte processing to encode the block.  The overhead for COBS encoding is very small in terms of the in-memory state required.
> h1. Decoding
> Block readers are not required to do as much byte-by-byte processing as a writer.  The reader could (for example) find a _metadata_ block by doing the following:
> # Search for a zero byte in the file which marks the start of a record
> # Read and zig-zag decode the _type_ of the block
> #* If the block is _normal_ data, read the _length_, seek ahead to the next block and goto step #2 again
> #* If the block is a _metadata_ block, cobs decode the data

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.