arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joris Van den Bossche <jorisvandenboss...@gmail.com>
Subject Re: [Python][Dataset] API Batched file reads with multiple files schemas
Date Wed, 11 Nov 2020 12:25:12 GMT
Hi Ted,

The eventual goal is certainly to be able to deal with this kind of schema
"normalization" to a target schema, but currently only a limited set of
schema evolutions are supported: different column order, missing columns
(filled with nulls), upcasting null to any type are currently supported.
But eg any other type casting or renaming columns not yet.

For how to do this (but so within the limits of what kind of normalizations
are supported), there are two ways:

- Using the dataset "factory" function to let pyarrow discover the dataset
(crawl the filesystem to find data files, infer the schema). By default,
this `ds.dataset(..)` function "infers" the schema by reading it from the
first file it encounters. In C++ there is actually the option to check all
files to create a common schema, but this is not yet exposed in Python (
https://issues.apache.org/jira/browse/ARROW-8221). Then, there is also the
option pass a schema manually, if you know this beforehand.
- Using the lower level `ds.FileSystemDataset` interface as you are using
below. In this case you need to specify all the data file paths manually,
as well as the final schema of the dataset. So this is specifically meant
for the case where you know this information already, and want to avoid the
overhead of inferring it with the `ds.dataset()` factory function mentioned
above.

So from reading your mail, it seems you need the following features that
are currently not yet implemented:

- The ability to specify in the `ds.dataset(..)` function to infer the
schema from all files (ARROW-8221))
- More advanced schema normalization routines (type casting, column
renaming)

Does that sound correct?

Joris


On Tue, 10 Nov 2020 at 18:31, Ted Gooch <tgooch@netflix.com> wrote:

> I'm currently leveraging the Datasets API to read parquet files and
> running into a bit of an issue that I can't figure out. I have a set of
> files and a target schema. Each file in the set may have the same or a
> different schema than the target, but if the schema is different, it can be
> coerced into the target  from the source schema, by rearranging column
> order, changing column names, adding null columns and/or a limited set of
> type upcasting(e.g int32->int64).
>
> As far as I can tell, there doesn't seem to be a way to do this with the
> Datasets API if you don't have a file schema ahead of time.  I had been
> using the following:
>
>
>
>
> *arrow_dataset =
> ds.FileSystemDataset.from_paths([self._input.location()],
>
> schema=self._arrow_file.schema_arrow,
>                       format=ds.ParquetFileFormat(),
>                                     filesystem=fs.LocalFileSystem())*
>
> But in this case, I have to fetch the schema, and read a single-file at a
> time.
>

Note that you can pass a list of files, so you don't need to read a single
file at a time.


> I was hoping to be able to get more mileage from the Datasets API batching
> up and managing the memory for the reads. Is there any way that I can get
> around this?
>
> thanks!
> Ted Gooch
>

Mime
View raw message