arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miki Tebeka (JIRA)" <>
Subject [jira] [Commented] (ARROW-539) [Python] Support reading Parquet datasets with standard partition directory schemes
Date Mon, 20 Mar 2017 08:36:41 GMT


Miki Tebeka commented on ARROW-539:

Moving an email conversation to here (where it belongs).

[~tebeka] said:
I'm working on ARROW-539 (see

The code is almost done (sans testing). The issue I'm facing is in _add_parts (line 246).
The first for loop (line 251) adds Column while the 2nd for loop (line 255) adds Array. This
causes Table.from_arrays to complain. I've looks for a way to convert an Array to Column (or
the other way around) and didn't find anything obvious. I can of course convert to Python
list and back but this is wasteful.

Any pointers on how to do this?

(Later I'll convert the StringArray to a Dictionary as you suggested, starting simple ...)

And [~wesmckinn] answered
Looks like _schema_from_arrays must be refactored to have the type
check inside the for loop like

Unfortunately, I don't think this is going to work (for performance reasons):

arrays.append(from_pylist([parts[name]] * size))

I think you're going to need to create a function (in C++) like

type = int32()
arr = array_from_constant(type, 0)

You can then use these integers to make a DictionaryArray using
something like DictionaryArray.from_arrays

Doesn't look like that handles Arrow arrays as inputs. Probably this
deserves a new API that looks like
{{DictionaryArray.from_indices(indices, type).}}

New JIRA for this

> [Python] Support reading Parquet datasets with standard partition directory schemes
> -----------------------------------------------------------------------------------
>                 Key: ARROW-539
>                 URL:
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Miki Tebeka
>         Attachments: partitioned_parquet.tar.gz
> Currently, we only support multi-file directories with a flat structure (non-partitioned).

This message was sent by Atlassian JIRA

View raw message