arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: [Discuss][Format] Checksum/Hash signature for data
Date Wed, 06 Mar 2019 06:06:55 GMT
Doing some light research it looks xxhash has better cross-platform support
as is faster then a vanilla implementation of crc32 [1].  However, crc32c
(a slightly different crc32 algorithm) is hardware accelerated on newer
(circa 2016) Intel CPUs [2] and is potentially faster.

[1] https://cyan4973.github.io/xxHash/
[2] https://github.com/Cyan4973/xxHash/issues/62

On Tue, Mar 5, 2019 at 9:55 PM Micah Kornfield <emkornfield@gmail.com>
wrote:

> Thanks Philipp,
>
> Yeah, I probably shouldn't have said SHA1 either :)    I'm not too
> concerned with a particular hash/checksum implementation.  It would be good
> to have at least 1 or 2 well supported ones, and a migration path to
> support more if necessary without breaking file/streaming formats for
> backwards compatibility.
>
> Best,
> -Micah
>
> On Tue, Mar 5, 2019 at 9:47 PM Philipp Moritz <pcmoritz@gmail.com> wrote:
>
>> (I meant to say SHA256 instead of SHA1)
>>
>> On Tue, Mar 5, 2019 at 9:45 PM Philipp Moritz <pcmoritz@gmail.com> wrote:
>>
>>> Hey Micah,
>>>
>>> in plasma, we are using xxhash to compute a hash/checksum [1] (it is
>>> computed in parallel using multiple threads) and have good experience with
>>> it -- all data in Ray is checksummed this way. Initially there were
>>> problems with uninitialized bits in the arrow representation, but that has
>>> been resolved a while back, so there should be no blocker for this. It
>>> would also be great to benchmark xxhash against CRC32 and see how they
>>> compare performance wise. Initially we used SHA1 but there was non-trivial
>>> overhead. Maybe there is a better implementation out there (we used
>>> https://github.com/ray-project/ray/blob/master/src/ray/thirdparty/sha256.c
>>> ).
>>>
>>> Best,
>>> Philipp.
>>>
>>> [1]
>>> https://github.com/apache/arrow/blob/master/cpp/src/plasma/client.cc#L684
>>>
>>> On Tue, Mar 5, 2019 at 9:33 PM Micah Kornfield <emkornfield@gmail.com>
>>> wrote:
>>>
>>>> Hi Arrow Dev,
>>>> As we expand the use-cases for Arrow to move it more across system
>>>> boundaries (Flight) and make it live longer (e.g. in the file format),
>>>> it
>>>> seems to make sense to build in a mechanism for data integrity
>>>> verification
>>>> (e.g. a checksum like CRC32 or in some cases a cryptographic hash like
>>>> SHA1).
>>>>
>>>> This can be done a backwards compatible manner for the actual data
>>>> buffers
>>>> by adding metadata to the headers (this could be a use-case for custom
>>>> metadata but I would prefer to make it explicit).  However, to make
>>>> sure we
>>>> have full coverage, we would need to augment the stream [1] to be
>>>> something
>>>> like:
>>>>
>>>> <metadata_size: int32>
>>>> <metadata_flatbuffer: bytes>
>>>> <signature_size: int16>
>>>> <metadata signature>
>>>> <padding>
>>>> <message body>
>>>>
>>>> I don't think we should require implementations to actual use this
>>>> functionality but we should make it a possibility (signature size could
>>>> be
>>>> zero meaning no checksum/hash is provided) and have it be standardized
>>>> if
>>>> possible.
>>>>
>>>> Thoughts?
>>>>
>>>> Sorry if this has already been discussed but I could find anything from
>>>> searching JIRA or the mailing list archive, and it doesn't look like it
>>>> is
>>>> in the format spec.
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> [1] https://arrow.apache.org/docs/ipc.html
>>>>
>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message