spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yin Huai <yh...@databricks.com>
Subject Re: Dataframe nested schema inference from Json without type conflicts
Date Thu, 01 Oct 2015 22:54:16 GMT
Hi Ewan,

For your use case, you only need the schema inference to pick up the
structure of your data (basically you want spark sql to infer the type of
complex values like arrays and structs but keep the type of primitive
values as strings), right?

Thanks,

Yin

On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith <ewan.leith@realitymine.com>
wrote:

> We could, but if a client sends some unexpected records in the schema
> (which happens more than I'd like, our schema seems to constantly evolve),
> its fantastic how Spark picks up on that data and includes it.
>
>
> Passing in a fixed schema loses that nice additional ability, though it's
> what we'll probably have to adopt if we can't come up with a way to keep
> the inference working.
>
>
> Thanks,
>
> Ewan
>
>
> ------ Original message------
>
> *From: *Reynold Xin
>
> *Date: *Thu, 1 Oct 2015 22:12
>
> *To: *Ewan Leith;
>
> *Cc: *dev@spark.apache.org;
>
> *Subject:*Re: Dataframe nested schema inference from Json without type
> conflicts
>
>
> You can pass the schema into json directly, can't you?
>
> On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith <ewan.leith@realitymine.com>
> wrote:
>
>> Hi all,
>>
>>
>>
>> We really like the ability to infer a schema from JSON contained in an
>> RDD, but when we’re using Spark Streaming on small batches of data, we
>> sometimes find that Spark infers a more specific type than it should use,
>> for example if the json in that small batch only contains integer values
>> for a String field, it’ll class the field as an Integer type on one
>> Streaming batch, then a String on the next one.
>>
>>
>>
>> Instead, we’d rather match every value as a String type, then handle any
>> casting to a desired type later in the process.
>>
>>
>>
>> I don’t think there’s currently any simple way to avoid this that I can
>> see, but we could add the functionality in the JacksonParser.scala file,
>> probably in convertField.
>>
>>
>>
>>
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala
>>
>>
>>
>> Does anyone know an easier and cleaner way to do this?
>>
>>
>>
>> Thanks,
>>
>> Ewan
>>
>
>

Mime
View raw message