avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (AVRO-712) define memcmp'able encoding
Date Thu, 23 Dec 2010 18:46:46 GMT

    [ https://issues.apache.org/jira/browse/AVRO-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974691#action_12974691
] 

Scott Carey commented on AVRO-712:
----------------------------------

Ints/Longs can be variable length, with the same space cost as before: The first byte in the
sequence has the sign bit first, inverted (0 for negative numbers, 1 for positive).  The second
bit is the 'is last byte in varint' bit -- 1 if there are more, 0 if it is the last byte.
 The remaining encoding is like before -- zig-zag packed absolute value of the number.  However,
if the number is negative we need to flip those bits as well as the 'is last byte in varint'
bits.  Both encoding and decoding are more expensive this way, but memcmp would work and the
space used would be the same.

Bytes/Strings/Arrays:
All are prefixed with a length in the current format.  you can't prefix with a length and
get "abc" < "abcd" as well as "abc" < "b".  Lexicographic ordering of variable length
fields is hard.  We could use a similar tactic to the varints above for Bytes and store 7
bits per byte, with a trailing marker bit indicating the end of the sequence.  This takes
more space, and is definitely more expensive to serialize/deserialize.

I think UTF-8's binary properties might be beneficial here for Strings.  At least for each
sequence of bytes that represents a character, it may already sort properly.  We just need
a way to identify the end of the string.  I suppose trailing marker bytes or bits will work
here too, with differing space/time tradeoffs.

Arrays can't use the trailing marker bit, a trailing marker byte per element would work.

Records:  I think if "ignored" fields were placed at the end, it would just work for sorting
(equal fields would be adjacent) but an equivalence test would fail (one processes as >
when they are equal).  



Overall, I am not convinced such an encoding would have that much value.  The extra time/space
cost of encoding and decoding in many use cases would outweigh the savings of sorting and
comparison.   For some languages, the cost of doing the comparison on the current binary format
is either not too great or could be optimized quite a bit further than they are now (C/C++/Java).
 More dynamic languages may benefit more from being able to call into an optimized memcmp
routine since their conditional logic and method call overhead is greater.  For those languages,
perhaps we should call out to an optimized Avro C or C++ stub for comparisons?  I wonder if
we put our efforts into more optimized comparison, serialization, and deserialization if we
would end up in a better place than adding a binary format.



> define memcmp'able encoding
> ---------------------------
>
>                 Key: AVRO-712
>                 URL: https://issues.apache.org/jira/browse/AVRO-712
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Doug Cutting
>         Attachments: memcmp_encoding_prototype.py
>
>
> It would be useful to have an encoding for Avro data that ordered data according to the
Avro specification under memcmp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message