spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kant kodali <kanth...@gmail.com>
Subject Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)
Date Thu, 17 Nov 2016 20:24:09 GMT
Can we have a JSONType for Spark SQL?

On Wed, Nov 16, 2016 at 8:41 PM, Nathan Lande <nathanlande@gmail.com> wrote:

> If you are dealing with a bunch of different schemas in 1 field, figuring
> out a strategy to deal with that will depend on your data and does not
> really have anything to do with spark since mapping your JSON payloads to
> tractable data structures will depend on business logic.
>
> The strategy of pulling out a blob into its on rdd and feeding it into the
> JSON loader should work for any data source once you have your data
> strategy figured out.
>
> On Wed, Nov 16, 2016 at 4:39 PM, kant kodali <kanth909@gmail.com> wrote:
>
>> 1. I have a Cassandra Table where one of the columns is blob. And this
>> blob contains a JSON encoded String however not all the blob's across the
>> Cassandra table for that column are same (some blobs have difference json's
>> than others) so In that case what is the best way to approach it? Do we
>> need to put /group all the JSON Blobs that have same structure (same keys)
>> into each individual data frame? For example, say if I have 5 json blobs
>> that have same structure and another 3 JSON blobs that belongs to some
>> other structure In this case do I need to create two data frames? (Attached
>> is a screen shot of 2 rows of how my json looks like)
>> 2. In my case, toJSON on RDD doesn't seem to help a lot. Attached a
>> screen shot. Looks like I got the same data frame as my original one.
>>
>> Thanks much for these examples.
>>
>>
>>
>> On Wed, Nov 16, 2016 at 2:54 PM, Nathan Lande <nathanlande@gmail.com>
>> wrote:
>>
>>> I'm looking forward to 2.1 but, in the meantime, you can pull out the
>>> specific column into an RDD of JSON objects, pass this RDD into the
>>> read.json() and then join the results back onto your initial DF.
>>>
>>> Here is an example of what we do to unpack headers from Avro log data:
>>>
>>> def jsonLoad(path):
>>>     #
>>>     #load in the df
>>>     raw = (sqlContext.read
>>>             .format('com.databricks.spark.avro')
>>>             .load(path)
>>>         )
>>>     #
>>>     #define json blob, add primary key elements (hi and lo)
>>>     #
>>>     JSONBlob = concat(
>>>         lit('{'),
>>>         concat(lit('"lo":'), col('header.eventId.lo').cast('string'),
>>> lit(',')),
>>>         concat(lit('"hi":'), col('header.eventId.hi').cast('string'),
>>> lit(',')),
>>>         concat(lit('"response":'), decode('requestResponse.response',
>>> 'UTF-8')),
>>>         lit('}')
>>>     )
>>>     #
>>>     #extract the JSON blob as a string
>>>     rawJSONString = raw.select(JSONBlob).rdd.map(lambda x: str(x[0]))
>>>     #
>>>     #transform the JSON string into a DF struct object
>>>     structuredJSON = sqlContext.read.json(rawJSONString)
>>>     #
>>>     #join the structured JSON back onto the initial DF using the hi and
>>> lo join keys
>>>     final = (raw.join(structuredJSON,
>>>                 ((raw['header.eventId.lo'] == structuredJSON['lo']) &
>>> (raw['header.eventId.hi'] == structuredJSON['hi'])),
>>>                 'left_outer')
>>>             .drop('hi')
>>>             .drop('lo')
>>>         )
>>>     #
>>>     #win
>>>     return final
>>>
>>> On Wed, Nov 16, 2016 at 10:50 AM, Michael Armbrust <
>>> michael@databricks.com> wrote:
>>>
>>>> On Wed, Nov 16, 2016 at 2:49 AM, Hyukjin Kwon <gurwls223@gmail.com>
>>>> wrote:
>>>>
>>>>> Maybe it sounds like you are looking for from_json/to_json functions
>>>>> after en/decoding properly.
>>>>>
>>>>
>>>> Which are new built-in functions that will be released with Spark 2.1.
>>>>
>>>
>>>
>>
>

Mime
View raw message