arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Gooch <tgo...@netflix.com>
Subject Re: [Python][Dataset] API Batched file reads with multiple files schemas
Date Wed, 11 Nov 2020 18:09:53 GMT
That's right.  For more context, I'm building out the Parquet read-path for
iceberg <https://iceberg.apache.org/>, and two of the main features are
working against us here: 1) the table data does not depend on the physical
layout on the filesystem, eg. folders may have many files some of which
belong to the current state and some of which do not. 2) schema-evolution -
files may have different schemas, and we don't know ahead of time which
version of the schema a given file will have.

Here is the current PR if you are interested to see the full context
in-code:
https://github.com/apache/iceberg/pull/1727

On Wed, Nov 11, 2020 at 4:25 AM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> Hi Ted,
>
> The eventual goal is certainly to be able to deal with this kind of schema
> "normalization" to a target schema, but currently only a limited set of
> schema evolutions are supported: different column order, missing columns
> (filled with nulls), upcasting null to any type are currently supported.
> But eg any other type casting or renaming columns not yet.
>
> For how to do this (but so within the limits of what kind of
> normalizations are supported), there are two ways:
>
> - Using the dataset "factory" function to let pyarrow discover the dataset
> (crawl the filesystem to find data files, infer the schema). By default,
> this `ds.dataset(..)` function "infers" the schema by reading it from the
> first file it encounters. In C++ there is actually the option to check all
> files to create a common schema, but this is not yet exposed in Python (
> https://issues.apache.org/jira/browse/ARROW-8221). Then, there is also
> the option pass a schema manually, if you know this beforehand.
> - Using the lower level `ds.FileSystemDataset` interface as you are using
> below. In this case you need to specify all the data file paths manually,
> as well as the final schema of the dataset. So this is specifically meant
> for the case where you know this information already, and want to avoid the
> overhead of inferring it with the `ds.dataset()` factory function mentioned
> above.
>
> So from reading your mail, it seems you need the following features that
> are currently not yet implemented:
>
> - The ability to specify in the `ds.dataset(..)` function to infer the
> schema from all files (ARROW-8221))
> - More advanced schema normalization routines (type casting, column
> renaming)
>
> Does that sound correct?
>
> Joris
>
>
> On Tue, 10 Nov 2020 at 18:31, Ted Gooch <tgooch@netflix.com> wrote:
>
>> I'm currently leveraging the Datasets API to read parquet files and
>> running into a bit of an issue that I can't figure out. I have a set of
>> files and a target schema. Each file in the set may have the same or a
>> different schema than the target, but if the schema is different, it can be
>> coerced into the target  from the source schema, by rearranging column
>> order, changing column names, adding null columns and/or a limited set of
>> type upcasting(e.g int32->int64).
>>
>> As far as I can tell, there doesn't seem to be a way to do this with the
>> Datasets API if you don't have a file schema ahead of time.  I had been
>> using the following:
>>
>>
>>
>>
>> *arrow_dataset =
>> ds.FileSystemDataset.from_paths([self._input.location()],
>>
>> schema=self._arrow_file.schema_arrow,
>>                       format=ds.ParquetFileFormat(),
>>                                     filesystem=fs.LocalFileSystem())*
>>
>> But in this case, I have to fetch the schema, and read a single-file at a
>> time.
>>
>
> Note that you can pass a list of files, so you don't need to read a single
> file at a time.
>

I have tried this, but if the schemas don't line up it will error out.


>
>
>> I was hoping to be able to get more mileage from the Datasets API
>> batching up and managing the memory for the reads. Is there any way that I
>> can get around this?
>>
>> thanks!
>> Ted Gooch
>>
>

Mime
View raw message