The eventual goal is certainly to be able to deal with this kind of schema "normalization" to a target schema, but currently only a limited set of schema evolutions are supported: different column order, missing columns (filled with nulls), upcasting null to any type are currently supported. But eg any other type casting or renaming columns not yet.
For how to do this (but so within the limits of what kind of normalizations are supported), there are two ways:
- Using the dataset "factory" function to let pyarrow discover the dataset (crawl the filesystem to find data files, infer the schema). By default, this `ds.dataset(..)` function "infers" the schema by reading it from the first file it encounters. In C++ there is actually the option to check all files to create a common schema, but this is not yet exposed in Python (https://issues.apache.org/jira/browse/ARROW-8221
). Then, there is also the option pass a schema manually, if you know this beforehand.
- Using the lower level `ds.FileSystemDataset` interface as you are using below. In this case you need to specify all the data file paths manually, as well as the final schema of the dataset. So this is specifically meant for the case where you know this information already, and want to avoid the overhead of inferring it with the `ds.dataset()` factory function mentioned above.
So from reading your mail, it seems you need the following features that are currently not yet implemented:
- The ability to specify in the `ds.dataset(..)` function to infer the schema from all files (ARROW-8221))
- More advanced schema normalization routines (type casting, column renaming)
Does that sound correct?
I'm currently leveraging the Datasets API to read parquet files and running into a bit of an issue that I can't figure out. I have a set of files and a target schema. Each file in the set may have the same or a different schema than the target, but if the schema is different, it can be coerced into the target from the source schema, by rearranging column order, changing column names, adding null columns and/or a limited set of type upcasting(e.g int32->int64).
As far as I can tell, there doesn't seem to be a way to do this with the Datasets API if you don't have a file schema ahead of time. I had been using the following:
arrow_dataset = ds.FileSystemDataset.from_paths([self._input.location()],
But in this case, I have to fetch the schema, and read a single-file at a time.
Note that you can pass a list of files, so you don't need to read a single file at a time.
I was hoping to be able to get more mileage from the Datasets API batching up and managing the memory for the reads. Is there any way that I can get around this?