arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: [C++]: Unexpected results from fixed_len_byte_array signed comparator
Date Tue, 02 Feb 2021 21:20:10 GMT
I think this is https://issues.apache.org/jira/browse/PARQUET-1655  would
you like to provide a patch?

On Tue, Feb 2, 2021 at 12:42 PM g. g. grey <g.g.grey@gmail.com> wrote:

> Hi.
>
> I'm relatively new to arrow/parquet, so I'd appreciate help trying to
> figure out where I've gone awry.
>
> When writing integers as fixed_len_byte_arrays in parquet files, I've run
> into a scenario where the min/max statistics appear to be incorrect. I've
> debugged into the code and it looks like the CompareHelper<FLBAType,
> is_signed=true> is being used for comparison. This seems to do a
> lexicographic_compare of signed bytes; this is counter to what I expected,
> which is that the MSB would be used for sign comparison but that all other
> bytes would be unsigned comparison.
>
> I've hacked together a small example to show what I'm talking about,
> focusing on the comparator itself.  I'm wondering if I'm going awry in
> creating my FLBA, if my understanding of the comparison itself is flawed,
> or if there is an issue with the comparison.
>
> Thanks for your help!
> ggg
>
> namespace parquet {
>>
>> using schema::GroupNode;
>> using schema::NodePtr;
>> using schema::PrimitiveNode;
>>
>> namespace test {
>>
>> // ----------------------------------------------------------------------
>> // Test comparators
>>
>>   void printArray(const unsigned char * arr)
>>   {
>>     for(int i = 0; i < 8; i++ )
>>       {
>>         printf("[%d] 0x%x ", i, arr[i]);
>>       }
>>     printf("\n");
>>   }
>>
>> TEST(Comparison, SignedFLBA_error1) {
>>   int size = 8;
>>   auto comparator =
>>       MakeComparator<FLBAType>(Type::FIXED_LEN_BYTE_ARRAY,
>> SortOrder::SIGNED, size);
>>
>>   int64_t low = 1234;
>>   int64_t high = 0x8000;
>>
>>   // convert to big endian
>>   int64_t lowBE = ::arrow::BitUtil::ToBigEndian(low);
>>   printf("low 0x%llx lowBE 0x%llx\n", low, lowBE);
>>   printArray((unsigned char*)&lowBE);
>>
>>   int64_t highBE = ::arrow::BitUtil::ToBigEndian(high);
>>   printf("high 0x%llx highBE 0x%llx\n", high, highBE);
>>   printArray((unsigned char*)&highBE);
>>
>>   FLBA lowBEFlba((uint8_t*)&lowBE);
>>   FLBA highBEFlba((uint8_t*)&highBE);
>>
>>   // compare. Uses CompareHelper<FLBAType, is_signed=true>
>>   // This fails but should return true b/c 1234 < 0x8000.
>>   ASSERT_TRUE(comparator->Compare(lowBEFlba, highBEFlba));
>> }
>>
>
> The output from running the test
>
> [==========] Running 2 tests from 1 test suite.
>> [----------] Global test environment set-up.
>> [----------] 2 tests from Comparison
>> [ RUN      ] Comparison.SignedFLBA_error1
>> low 0x4d2 lowBE 0xd204000000000000
>> [0] 0x0 [1] 0x0 [2] 0x0 [3] 0x0 [4] 0x0 [5] 0x0 [6] 0x4 [7] 0xd2
>> high 0x8000 highBE 0x80000000000000
>> [0] 0x0 [1] 0x0 [2] 0x0 [3] 0x0 [4] 0x0 [5] 0x0 [6] 0x80 [7] 0x0
>> /Users/kbhmr/Software/arrow/arrow-master/cpp/src/parquet/short_statistics_test.cc:89:
>> Failure
>> Value of: comparator->Compare(lowBEFlba, highBEFlba)
>>   Actual: false
>> Expected: true
>> [  FAILED  ] Comparison.SignedFLBA_error1 (0 ms)
>>
>
>

Mime
View raw message