arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antoine Pitrou <anto...@python.org>
Subject Re: [Discuss][Format] Checksum/Hash signature for data
Date Mon, 18 Mar 2019 13:09:20 GMT

XXH3 (by the xxhash author) was recently presented, though it's still
experimental for now:
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html

It is claimed to be significantly faster than xxhash, on all message sizes.

Regards

Antoine.


Le 06/03/2019 à 07:06, Micah Kornfield a écrit :
> Doing some light research it looks xxhash has better cross-platform support
> as is faster then a vanilla implementation of crc32 [1].  However, crc32c
> (a slightly different crc32 algorithm) is hardware accelerated on newer
> (circa 2016) Intel CPUs [2] and is potentially faster.
> 
> [1] https://cyan4973.github.io/xxHash/
> [2] https://github.com/Cyan4973/xxHash/issues/62
> 
> On Tue, Mar 5, 2019 at 9:55 PM Micah Kornfield <emkornfield@gmail.com>
> wrote:
> 
>> Thanks Philipp,
>>
>> Yeah, I probably shouldn't have said SHA1 either :)    I'm not too
>> concerned with a particular hash/checksum implementation.  It would be good
>> to have at least 1 or 2 well supported ones, and a migration path to
>> support more if necessary without breaking file/streaming formats for
>> backwards compatibility.
>>
>> Best,
>> -Micah
>>
>> On Tue, Mar 5, 2019 at 9:47 PM Philipp Moritz <pcmoritz@gmail.com> wrote:
>>
>>> (I meant to say SHA256 instead of SHA1)
>>>
>>> On Tue, Mar 5, 2019 at 9:45 PM Philipp Moritz <pcmoritz@gmail.com> wrote:
>>>
>>>> Hey Micah,
>>>>
>>>> in plasma, we are using xxhash to compute a hash/checksum [1] (it is
>>>> computed in parallel using multiple threads) and have good experience with
>>>> it -- all data in Ray is checksummed this way. Initially there were
>>>> problems with uninitialized bits in the arrow representation, but that has
>>>> been resolved a while back, so there should be no blocker for this. It
>>>> would also be great to benchmark xxhash against CRC32 and see how they
>>>> compare performance wise. Initially we used SHA1 but there was non-trivial
>>>> overhead. Maybe there is a better implementation out there (we used
>>>> https://github.com/ray-project/ray/blob/master/src/ray/thirdparty/sha256.c
>>>> ).
>>>>
>>>> Best,
>>>> Philipp.
>>>>
>>>> [1]
>>>> https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.cc#L684
>>>>
>>>> On Tue, Mar 5, 2019 at 9:33 PM Micah Kornfield <emkornfield@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Arrow Dev,
>>>>> As we expand the use-cases for Arrow to move it more across system
>>>>> boundaries (Flight) and make it live longer (e.g. in the file format),
>>>>> it
>>>>> seems to make sense to build in a mechanism for data integrity
>>>>> verification
>>>>> (e.g. a checksum like CRC32 or in some cases a cryptographic hash like
>>>>> SHA1).
>>>>>
>>>>> This can be done a backwards compatible manner for the actual data
>>>>> buffers
>>>>> by adding metadata to the headers (this could be a use-case for custom
>>>>> metadata but I would prefer to make it explicit).  However, to make
>>>>> sure we
>>>>> have full coverage, we would need to augment the stream [1] to be
>>>>> something
>>>>> like:
>>>>>
>>>>> <metadata_size: int32>
>>>>> <metadata_flatbuffer: bytes>
>>>>> <signature_size: int16>
>>>>> <metadata signature>
>>>>> <padding>
>>>>> <message body>
>>>>>
>>>>> I don't think we should require implementations to actual use this
>>>>> functionality but we should make it a possibility (signature size could
>>>>> be
>>>>> zero meaning no checksum/hash is provided) and have it be standardized
>>>>> if
>>>>> possible.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> Sorry if this has already been discussed but I could find anything from
>>>>> searching JIRA or the mailing list archive, and it doesn't look like
it
>>>>> is
>>>>> in the format spec.
>>>>>
>>>>> Thanks,
>>>>> Micah
>>>>>
>>>>> [1] https://arrow.apache.org/docs/ipc.html
>>>>>
>>>>
> 

Mime
View raw message