arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Implicit cast in PyArrow JSON schema inference (i.e. Integer -> String)
Date Mon, 15 Mar 2021 23:39:43 GMT
You can pass an explicit schema in the ParseOptions, but I don't know
if it will see "string" in the schema and promote integers (if not,
could you open a Jira about this?). Otherwise I'm not sure that
automatic "loose" type inference is a good default behavior (though
possibly something that could be opted into).

On Mon, Mar 15, 2021 at 4:04 PM Pavol Knapek <knapek.pavol@gmail.com> wrote:
>
> Hi guys,
>
> I'm trying to use the `pyarrow.json.read_json('input.json')` command - to load a JSON
file, infer the schema, and return a new `pyarrow.Table` instance.
>
> So, given an input:
> {"col1": "1"}
> {"col1": 1}
>
> I'd expect the output `pyarrow.Table` to have a schema {col1: string}, with an implicit
cast of Integer(s) to String(s).
>
> (As it gets inferred in a similar way i.e. by Apache Spark)
>
> But instead, an exception gets raised:
> ArrowInvalid: JSON parse error: Column(/col1) changed from string to number in row 1
>
> Is there some way to let the infer-process know it can safely cast all types to a super-type,
if possible (i.e. Integer -> String, Object -> String, Anything -> String, ...)?
>
> Thanks
>
> Best
> --
> Pavol Knapek
> mobile CA: +1 604 314 6164
> mobile CZ: +420 774 293 243
> mobile SK: +421 917 557 263
> e-mail: knapek.pavol@gmail.com
> http://linkedin.com/in/pavolknapek

Mime
View raw message