Pyarrow version is 3.0.0

 

Naively, I would expect the max and min to not just reflect the max and min value of the dictionary for each row group, but the max and min value of the actual values in the rowgroup.

 

I looked at the Parquet spec which seems to reflect this as it refers to the statistics applying to the logical type of the column, but I may be misunderstanding.

 

This is just a toy example, of course. The real data I'm working with is quite a bit larger and ordered on the column this applies to, so being able to use the statistics for predicate pushdown would be ideal.

 

If pyarrow.parquet.write_table is not the preferred way to write Parquet files out from Arrow data and there is a more germane method, I'd appreciate being elucidated. I'd also appreciate any workaround suggestions for the time being.

 

Thank you,

-Dan Nugent

 

>>> import pyarrow as pa

>>> import pyarrow.parquet as papq

>>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"])

>>> t = pa.table({"col":d})

>>> papq.write_table(t,'sample.parquet',row_group_size=100)

>>> f = papq.ParquetFile('sample.parquet')

>>> (f.metadata.row_group(0).column(0).statistics.min, f.metadata.row_group(0).column(0).statistics.max)

('A', 'B')

>>> (f.metadata.row_group(1).column(0).statistics.min, f.metadata.row_group(1).column(0).statistics.max)

('A', 'B')

>>> f.read_row_groups([0]).column(0)

<pyarrow.lib.ChunkedArray object at 0x7f37346abe90>

[

 

  -- dictionary:

    [

      "A",

      "B"

    ]

  -- indices:

    [

      0,

      0,

      0,

      0,

      0,

      0,

      0,

      0,

      0,

      0,

      ...

      0,

      0,

      0,

      0,

      0,

      0,

      0,

      0,

      0,

      0

    ]

]

>>> f.read_row_groups([1]).column(0)

<pyarrow.lib.ChunkedArray object at 0x7f37346abef0>

[

 

  -- dictionary:

    [

      "A",

      "B"

    ]

  -- indices:

    [

      1,

      1,

      1,

      1,

      1,

      1,

      1,

      1,

      1,

      1,

      ...

      1,

      1,

      1,

      1,

      1,

      1,

      1,

      1,

      1,

      1

    ]

]


######################################################################

The information contained in this communication is confidential and

may contain information that is privileged or exempt from disclosure

under applicable law. If you are not a named addressee, please notify

the sender immediately and delete this email from your system.

If you have received this communication, and are not a named

recipient, you are hereby notified that any dissemination,

distribution or copying of this communication is strictly prohibited.

######################################################################