arrow-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Bormans (Jira)" <>
Subject [jira] [Created] (ARROW-12644) Can't read from parquet partitioned by date/time (Spark)
Date Tue, 04 May 2021 09:32:00 GMT
Paul Bormans created ARROW-12644:

             Summary: Can't read from parquet partitioned by date/time (Spark)
                 Key: ARROW-12644
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 3.0.0
            Reporter: Paul Bormans

I'm using Spark (3.1.1) to write a dataframe to a partitioned parquet dataset (using
which is partitioned by a timestamp field.

The relevant Spark code:
// code placeholder

This gives a structure like following:
// code placeholder
/tip/Date=2021-05-04 00%3A00%3A00
/tip/Date=2021-05-04 00%3A00%3A00/Time=2021-05-04 07%3A27%3A00
/tip/Date=2021-05-04 00%3A00%3A00/Time=2021-05-04 07%3A27%3A00/part-00000-8846eb80-a369-43f6-a715-fec9cf1adf95.c000.snappy.parquet

Notice the : character is (url?) encoded because of fs protocol violation.

When i try to open this dataset using delta-rs ([] which
uses Arrow below the hood, then an error is raised trying to parse the Date (folder) value.
// code placeholder
pyarrow.lib.ArrowInvalid: error parsing '2021-05-03 00%3A00%3A00' as scalar of type timestamp[ns]
It seems this error is raised in ScalarParseImpl => ParseValue => StringConverter<TimestampType>::Convert
=> ParseTimestampISO8601

The mentioned parse method does support for format:
// code placeholder
static inline bool ParseTimestampISO8601(const char* s, size_t length,
                                         TimeUnit::type unit,
                                         TimestampType::c_type* out) {
  using seconds_type = std::chrono::duration<TimestampType::c_type>;  // We allow the
following formats for all units:
  // - "YYYY-MM-DD"
  // - "YYYY-MM-DD[ T]hhZ?"
  // - "YYYY-MM-DD[ T]hh:mmZ?"
  // - "YYYY-MM-DD[ T]hh:mm:ssZ?"
But may not support (url?) decoding the value upfront?

Questions we have:
 * Should Arrow support timestamp fields when used as partitioned field?
 * Where to decode?




This message was sent by Atlassian Jira

View raw message