avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-27) Consistent Overhead Byte Stuffing (COBS) encoded block format for Object Container Files
Date Fri, 08 May 2009 17:51:45 GMT

    [ https://issues.apache.org/jira/browse/AVRO-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707449#action_12707449
] 

Scott Carey commented on AVRO-27:
---------------------------------

{quote}I'm curious what other Java experts (since I'm not) out there feel about COBS in Java
. It sounds from Scott's comment that byte stuffing in Java is a non-starter.{quote}

That really depends on the performance requirement.

If the requirement is to be able to encapsulate data and stream at near Gigabit ethernet speed
or teamed Gigabit (~100MB/sec to 200MB/sec), it will get in the way.
If other things already significantly limit streaming capability then it may not be a large
incremental overhead.
For example, if the Avro serialization process is already going byte-by-byte somewhere else,
this could 'piggyback' almost for free -- but it would have to be embedded in that other code,
in the same loop.

I also want to highlight that the byte-by-byte streaming in Java can be compared to larger
chunk sizes with a fairly simple benchmark to validate (or disprove) my claims that it is
slow in comparison.

The data from LBL is useful.  It should be fairly easy to change that to a larger chunk size
and compare on a new JVM.  

I'll try to characterize this on my own time this weekend.


> Consistent Overhead Byte Stuffing (COBS) encoded block format for Object Container Files
> ----------------------------------------------------------------------------------------
>
>                 Key: AVRO-27
>                 URL: https://issues.apache.org/jira/browse/AVRO-27
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Matt Massie
>
> Object Container Files could use a 1 byte sync marker (set to zero) using zig-zag and
COBS encoding within blocks to efficiently escape zeros from the record data.
> h4. Zig-Zag encoding
> With zig-zag encoding only the value of 0 (zero) gets encoded into a value with a single
zero byte.  This property means that we can write any non-zero zig-zag long inside a block
within concern for creating an unintentional sync byte. 
> h4. COBS encoding
> We'll use COBS encoding to ensure that all zeros are escaped inside the block payload.
 You can read http://www.sigcomm.org/sigcomm97/papers/p062.pdf for the details about COBS
encoding.
> h1. Block Format
> All blocks start and end with a sync byte (set to zero) with a type-length-value format
internally as follows:
> || name || format || length in bytes || value || meaning ||
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker for the
start of a block |
> | type | zig-zag long | variable | must be non-zero | The type field expresses whether
the block is for _metadata_ or _normal_ data. |
> | length | zig-zag long | variable | must be non-zero | The length field expresses the
number of bytes until the next record (including the cobs code and sync byte).  Useful for
skipping ahead to the next block. |
> | cobs_code | byte | 1 | see COBS code table below | Used in escaping zeros from the
block payload |
> | payload | cobs-encoded | Greater than or equal to zero | all non-zero bytes | The payload
of the block |
> | sync | byte | 1 | always 0 (zero) | The sync byte serves as a clear marker for the
end of the block |
> h2. COBS code table 
> || Code || Followed by || Meaning | 
> | 0x00 | (not applicable) | (not allowed ) |
> | 0x01 | nothing | Empty payload followed by the closing sync byte |
> | 0x02 | one data byte | The single data byte, followed by the closing sync byte | 
> | 0x03 | two data bytes | The pair of data bytes, followed by the closing sync byte |
> | 0x04 | three data bytes | The three data bytes, followed by the closing sync byte |
> | n | (n-1) data bytes | The (n-1) data bytes, followed by the closing sync byte |
> | 0xFD | 252 data bytes | The 252 data bytes, followed by the closing sync byte |
> | 0xFE | 253 data bytes | The 253 data bytes, followed by the closing sync byte |
> | 0xFF | 254 data bytes | The 254 data bytes *not* followed by a zero. |
> (taken from http://www.sigcomm.org/sigcomm97/papers/p062.pdf)
> h1. Encoding
> Only the block writer needs to perform byte-by-byte processing to encode the block. 
The overhead for COBS encoding is very small in terms of the in-memory state required.
> h1. Decoding
> Block readers are not required to do as much byte-by-byte processing as a writer.  The
reader could (for example) find a _metadata_ block by doing the following:
> # Search for a zero byte in the file which marks the start of a record
> # Read and zig-zag decode the _type_ of the block
> #* If the block is _normal_ data, read the _length_, seek ahead to the next block and
goto step #2 again
> #* If the block is a _metadata_ block, cobs decode the data

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message