arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "g. g. grey" <g.g.g...@gmail.com>
Subject [C++]: Unexpected results from fixed_len_byte_array signed comparator
Date Tue, 02 Feb 2021 20:42:06 GMT
Hi.

I'm relatively new to arrow/parquet, so I'd appreciate help trying to
figure out where I've gone awry.

When writing integers as fixed_len_byte_arrays in parquet files, I've run
into a scenario where the min/max statistics appear to be incorrect. I've
debugged into the code and it looks like the CompareHelper<FLBAType,
is_signed=true> is being used for comparison. This seems to do a
lexicographic_compare of signed bytes; this is counter to what I expected,
which is that the MSB would be used for sign comparison but that all other
bytes would be unsigned comparison.

I've hacked together a small example to show what I'm talking about,
focusing on the comparator itself.  I'm wondering if I'm going awry in
creating my FLBA, if my understanding of the comparison itself is flawed,
or if there is an issue with the comparison.

Thanks for your help!
ggg

namespace parquet {
>
> using schema::GroupNode;
> using schema::NodePtr;
> using schema::PrimitiveNode;
>
> namespace test {
>
> // ----------------------------------------------------------------------
> // Test comparators
>
>   void printArray(const unsigned char * arr)
>   {
>     for(int i = 0; i < 8; i++ )
>       {
>         printf("[%d] 0x%x ", i, arr[i]);
>       }
>     printf("\n");
>   }
>
> TEST(Comparison, SignedFLBA_error1) {
>   int size = 8;
>   auto comparator =
>       MakeComparator<FLBAType>(Type::FIXED_LEN_BYTE_ARRAY,
> SortOrder::SIGNED, size);
>
>   int64_t low = 1234;
>   int64_t high = 0x8000;
>
>   // convert to big endian
>   int64_t lowBE = ::arrow::BitUtil::ToBigEndian(low);
>   printf("low 0x%llx lowBE 0x%llx\n", low, lowBE);
>   printArray((unsigned char*)&lowBE);
>
>   int64_t highBE = ::arrow::BitUtil::ToBigEndian(high);
>   printf("high 0x%llx highBE 0x%llx\n", high, highBE);
>   printArray((unsigned char*)&highBE);
>
>   FLBA lowBEFlba((uint8_t*)&lowBE);
>   FLBA highBEFlba((uint8_t*)&highBE);
>
>   // compare. Uses CompareHelper<FLBAType, is_signed=true>
>   // This fails but should return true b/c 1234 < 0x8000.
>   ASSERT_TRUE(comparator->Compare(lowBEFlba, highBEFlba));
> }
>

The output from running the test

[==========] Running 2 tests from 1 test suite.
> [----------] Global test environment set-up.
> [----------] 2 tests from Comparison
> [ RUN      ] Comparison.SignedFLBA_error1
> low 0x4d2 lowBE 0xd204000000000000
> [0] 0x0 [1] 0x0 [2] 0x0 [3] 0x0 [4] 0x0 [5] 0x0 [6] 0x4 [7] 0xd2
> high 0x8000 highBE 0x80000000000000
> [0] 0x0 [1] 0x0 [2] 0x0 [3] 0x0 [4] 0x0 [5] 0x0 [6] 0x80 [7] 0x0
> /Users/kbhmr/Software/arrow/arrow-master/cpp/src/parquet/short_statistics_test.cc:89:
> Failure
> Value of: comparator->Compare(lowBEFlba, highBEFlba)
>   Actual: false
> Expected: true
> [  FAILED  ] Comparison.SignedFLBA_error1 (0 ms)
>

Mime
View raw message