arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Gooch <tgo...@netflix.com>
Subject [Python][Dataset] API Batched file reads with multiple files schemas
Date Tue, 10 Nov 2020 17:29:26 GMT
I'm currently leveraging the Datasets API to read parquet files and
running into a bit of an issue that I can't figure out. I have a set of
files and a target schema. Each file in the set may have the same or a
different schema than the target, but if the schema is different, it can be
coerced into the target  from the source schema, by rearranging column
order, changing column names, adding null columns and/or a limited set of
type upcasting(e.g int32->int64).

As far as I can tell, there doesn't seem to be a way to do this with the
Datasets API if you don't have a file schema ahead of time.  I had been
using the following:




*arrow_dataset = ds.FileSystemDataset.from_paths([self._input.location()],

schema=self._arrow_file.schema_arrow,
                      format=ds.ParquetFileFormat(),
                                    filesystem=fs.LocalFileSystem())*

But in this case, I have to fetch the schema, and read a single-file at a
time.  I was hoping to be able to get more mileage from the Datasets API
batching up and managing the memory for the reads. Is there any way that I
can get around this?

thanks!
Ted Gooch

Mime
View raw message