Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C7420104A5 for ; Thu, 20 Jun 2013 19:06:24 +0000 (UTC) Received: (qmail 17389 invoked by uid 500); 20 Jun 2013 19:06:19 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 17208 invoked by uid 500); 20 Jun 2013 19:06:19 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 17200 invoked by uid 99); 20 Jun 2013 19:06:19 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Jun 2013 19:06:19 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cerebrotecnologico@gmail.com designates 209.85.128.179 as permitted sender) Received: from [209.85.128.179] (HELO mail-ve0-f179.google.com) (209.85.128.179) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Jun 2013 19:06:11 +0000 Received: by mail-ve0-f179.google.com with SMTP id d10so5388953vea.10 for ; Thu, 20 Jun 2013 12:05:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=3rGd8NW8LvYEN/P6dzOIcBMfVvWqChWh1axmSTi5ZQA=; b=a7vv3rxlMod6g4GoEwPY1K3aStBYZnlL4IWsZgtDugyZcd+L4M8fl4Oioa6PN8L2HR KW5d3GRp+TIiejuh8qXe1dSUzzM8pxlYNRYpvDK3b94+jwp+Bu60FnaSo7LtNBu06Skg ywDSqjM5Wmx86btT+PrOJY+klEfrx5GUqtR0xT/kQqHl/SNgJI0r07Btj1rF1qKiaMM9 ucURbNhJimFDngAgTumYoy08uPkQS2Ot+iBkQkd4BsnF87bnCBHAGC3xnWRCAgcXBjKM H3xg7RcNxH2ydX01ttSzczKsD0k8Xp3GgDtDjjEqRbSFhKn/Yi2KCPd3AwjlGkppQtnm c/OQ== MIME-Version: 1.0 X-Received: by 10.52.35.17 with SMTP id d17mr3270959vdj.74.1371755150582; Thu, 20 Jun 2013 12:05:50 -0700 (PDT) Received: by 10.58.202.66 with HTTP; Thu, 20 Jun 2013 12:05:50 -0700 (PDT) In-Reply-To: References: Date: Thu, 20 Jun 2013 12:05:50 -0700 Message-ID: Subject: Re: Aggregating data nested into JSON documents From: Tecno Brain To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=20cf307cfc407829d604df9aa2a1 X-Virus-Checked: Checked by ClamAV on apache.org --20cf307cfc407829d604df9aa2a1 Content-Type: text/plain; charset=ISO-8859-1 Never mind, I got the solution! uberflat = FOREACH flat GENERATE g, sg, FLATTEN(important-data#'f1') as f1, FLATTEN(important-data#'f2') as f2; -Jorge On Thu, Jun 20, 2013 at 11:54 AM, Tecno Brain wrote: > OK, I'll go back to my original question ( although this time I know what > tools I'm using). > > I am using Pig + ElephantBird. > > I have JSON documents with the following structure: > { > g : "some-group-identifier", > sg: "some-subgroup-identifier", > j : "some-job-identifier", > page : 23, > ... // other fields omitted > important-data : [ > { > f1 : "abc", > f2 : "a", > f3 : "/" > ... > }, > ... > { > f1 : "xyz", > f2 : "q", > f3 : "/", > ... > }, > ] > ... // other fields omitted > } > > I want Pig to GENERATE a tuple for each element on the "important-data" > array attribute. For the example above, I would like to generate: > > ( "some-group-identifier" , "some-subgroup-identifier", 23, "abc", "a", > "/" ) > ( "some-group-identifier" , "some-subgroup-identifier", 23, "xyz", "q", > "/" ) > > This is what I have tried: > > doc = LOAD '/example.json' USING > com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as > (json:map[]); > flat = FOREACH doc GENERATE (chararray)json#'gr' as g, (long)json#'sg' > as sg, FLATTEN( json#'important-data') ; > DUMP flat; > > but that produces: > > ( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#abc, > f2#a, f3#/ ] ) > ( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#xyz, > f2#q, f3#/ ] ) > > Close, but not exactly what I want. > > Do I require to use ProtoBuf ? > > -Jorge > > > On Wed, Jun 19, 2013 at 3:44 PM, Tecno Brain > wrote: > >> Ok, I found that elephant-bird JsonLoader cannot handle JSON documents >> that are pretty-printed. (expanding over multiple-lines) The entire json >> document has to be on a single line. >> >> After I reformated some of the source files, now I am getting the >> expected output. >> >> >> >> >> On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain < >> cerebrotecnologico@gmail.com> wrote: >> >>> I also tried: >>> >>> doc = LOAD '/json-pcr/pcr-000001.json' USING >>> com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]); >>> flat = FOREACH doc GENERATE (chararray)json#'a' AS first, >>> (long)json#'b' AS second ; >>> DUMP flat; >>> >>> but I got no output either. >>> >>> Input(s): >>> Successfully read 0 records (35863 bytes) from: >>> "/json-pcr/pcr-000001.json" >>> >>> Output(s): >>> Successfully stored 0 records in: >>> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210" >>> >>> >>> >>> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain < >>> cerebrotecnologico@gmail.com> wrote: >>> >>>> I got Pig and Hive working ona single-node and I am able to run some >>>> script/queries over regular text files (access log files); with a record >>>> per line. >>>> >>>> Now, I want to process some JSON files. >>>> >>>> As mentioned before, it seems that ElephantBird would be a would be a >>>> good solution to read JSON files. >>>> >>>> I uploaded 5 files to HDFS. Each file only contain a single JSON >>>> document. The documents are NOT in a single line, but rather contain >>>> pretty-printed JSON expanding over multiple lines. >>>> >>>> I'm trying something simple, extracting two (primitive) attributes at >>>> the top of the document: >>>> { >>>> a : "some value", >>>> ... >>>> b : 133, >>>> ... >>>> } >>>> >>>> So, lets start with a LOAD of a single file (single JSON document): >>>> >>>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies'; >>>> doc = LOAD '/json-pcr/pcr-000001.json' using >>>> com.twitter.elephantbird.pig.load.JsonLoader(); >>>> flat = FOREACH doc GENERATE (chararray)$0#'a' AS first, (long)$0#'b' >>>> AS second ; >>>> DUMP flat; >>>> >>>> Apparently the job runs without problem, but I get no output. The >>>> output I get includes this message: >>>> >>>> Input(s): >>>> Successfully read 0 records (35863 bytes) from: >>>> "/json-pcr/pcr-000001.json" >>>> >>>> I was expecting to get >>>> >>>> ( "some value", 133 ) >>>> >>>> Any idea on what I am doing wrong? >>>> >>>> >>>> >>>> >>>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel < >>>> michael_segel@hotmail.com> wrote: >>>> >>>>> I think you have a misconception of HBase. >>>>> >>>>> You don't need to actually have mutable data for it to be effective. >>>>> The key is that you need to have access to specific records and work a >>>>> very small subset of the data and not the complete data set. >>>>> >>>>> >>>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain < >>>>> cerebrotecnologico@gmail.com> wrote: >>>>> >>>>> Hi Mike, >>>>> >>>>> Yes, I also have thought about HBase or Cassandra but my data is >>>>> pretty much a snapshot, it does not require updates. Most of my >>>>> aggregations will also need to be computed once and won't change over time >>>>> with the exception of some aggregation that is based on the last N days of >>>>> data. Should I still consider HBase ? I think that probably it will be >>>>> good for the aggregated data. >>>>> >>>>> I have no idea what are sequence files, but I will take a look. My >>>>> raw data is stored in the cloud, not in my Hadoop cluster. >>>>> >>>>> I'll keep looking at Pig with ElephantBird. >>>>> Thanks, >>>>> >>>>> -Jorge >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel < >>>>> michael_segel@hotmail.com> wrote: >>>>> >>>>>> Hi.. >>>>>> >>>>>> Have you thought about HBase? >>>>>> >>>>>> I would suggest that if you're using Hive or Pig, to look at taking >>>>>> these files and putting the JSON records in to a sequence file. >>>>>> Or set of sequence files.... (Then look at HBase to help index >>>>>> them...) 200KB is small. >>>>>> >>>>>> That would be the same for either pig/hive. >>>>>> >>>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty >>>>>> nice. And yes you get each record as a row, however you can always flatten >>>>>> them as needed. >>>>>> >>>>>> Hive? >>>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or >>>>>> Edward Capriolo could give you a better answer. >>>>>> Going from memory, I don't know that there is a good SerDe that would >>>>>> write JSON, just read it. (Hive) >>>>>> >>>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be >>>>>> dated and biased. >>>>>> >>>>>> I think you're on the right track or at least train of thought. >>>>>> >>>>>> HTH >>>>>> >>>>>> -Mike >>>>>> >>>>>> >>>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain < >>>>>> cerebrotecnologico@gmail.com> wrote: >>>>>> >>>>>> Hello, >>>>>> I'm new to Hadoop. >>>>>> I have a large quantity of JSON documents with a structure similar >>>>>> to what is shown below. >>>>>> >>>>>> { >>>>>> g : "some-group-identifier", >>>>>> sg: "some-subgroup-identifier", >>>>>> j : "some-job-identifier", >>>>>> page : 23, >>>>>> ... // other fields omitted >>>>>> important-data : [ >>>>>> { >>>>>> f1 : "abc", >>>>>> f2 : "a", >>>>>> f3 : "/" >>>>>> ... >>>>>> }, >>>>>> ... >>>>>> { >>>>>> f1 : "xyz", >>>>>> f2 : "q", >>>>>> f3 : "/", >>>>>> ... >>>>>> }, >>>>>> ], >>>>>> ... // other fields omitted >>>>>> other-important-data : [ >>>>>> { >>>>>> x1 : "ford", >>>>>> x2 : "green", >>>>>> x3 : 35 >>>>>> map : { >>>>>> "free-field" : "value", >>>>>> "other-free-field" : value2" >>>>>> } >>>>>> }, >>>>>> ... >>>>>> { >>>>>> x1 : "vw", >>>>>> x2 : "red", >>>>>> x3 : 54, >>>>>> ... >>>>>> }, >>>>>> ] >>>>>> }, >>>>>> } >>>>>> >>>>>> >>>>>> Each file contains a single JSON document (gzip compressed, and >>>>>> roughly about 200KB uncompressed of pretty-printed json text per document) >>>>>> >>>>>> I am interested in analyzing only the "important-data" array and the >>>>>> "other-important-data" array. >>>>>> My source data would ideally be easier to analyze if it looked like a >>>>>> couple of tables with a fixed set of columns. Only the column "map" would >>>>>> be a complex column, all others would be primitives. >>>>>> >>>>>> ( g, sg, j, page, f1, f2, f3 ) >>>>>> >>>>>> ( g, sg, j, page, x1, x2, x3, map ) >>>>>> >>>>>> So, for each JSON document, I would like to "create" several rows, >>>>>> but I would like to avoid the intermediate step of persisting -and >>>>>> duplicating- the "flattened" data. >>>>>> >>>>>> In order to avoid persisting the data flattened, I thought I had to >>>>>> write my own map-reduce in Java code, but discovered that others have had >>>>>> the same problem of using JSON as the source and there are somewhat >>>>>> "standard" solutions. >>>>>> >>>>>> By reading about the SerDe approach for Hive I get the impression >>>>>> that each JSON document is transformed into a single "row" of the table >>>>>> with some columns being an array, a map of other nested structures. >>>>>> a) Is there a way to break each JSON document into several "rows" for >>>>>> a Hive external table? >>>>>> b) It seems there are too many JSON SerDe libraries! Is there any of >>>>>> them considered the de-facto standard? >>>>>> >>>>>> The Pig approach seems also promising using Elephant Bird Do anybody >>>>>> has pointers to more user documentation on this project? Or is browsing >>>>>> through the examples in GitHub my only source? >>>>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >> > --20cf307cfc407829d604df9aa2a1 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Never mind, I got the solution!

uberflat =3D FOREACH flat GENERATE = g, sg,=A0
=A0 =A0 = =A0 =A0 =A0 =A0 =A0 FLATTEN(important-data#'f1') as f1,=A0
=A0 =A0 =A0 =A0 =A0 =A0 =A0 FLAT= TEN(important-data#'f2') as f2;

-Jorge


On Thu, Jun 20, 2013 at 11:54 AM, Tecno Brain <cerebrotecnologi= co@gmail.com> wrote:
OK, I'll go back to my original question ( although th= is time I know what tools I'm using).

I am using Pig= + ElephantBird.=A0

I have JSON documents with the follo= wing structure:
{
=A0 =A0 =A0g : "some-group-identifier",
=A0 =A0 =A0sg: "some-subgroup-ident= ifier",
=A0 =A0 =A0j =A0 =A0 =A0: "some-= job-identifier",
=A0 =A0 =A0page =A0= =A0 : 23,
=A0 =A0 =A0... // other fields=A0omit= ted
= =A0 =A0 =A0important-data : [<= /div>
=A0 =A0 =A0 =A0 =A0{
=A0 =A0 =A0 =A0 =A0 =A0f1 =A0: "abc",
=A0 =A0 =A0 =A0 =A0 =A0f2 =A0: "= a",
=A0 =A0 =A0 =A0 =A0 =A0f3 =A0: &= quot;/"
=A0 =A0 =A0 =A0 =A0 =A0...
=A0 =A0 =A0 =A0 =A0},
=A0 =A0 =A0 =A0 =A0...
=A0 =A0 =A0 =A0 =A0{
=A0 =A0 =A0 =A0 =A0 =A0f1 : "xyz= ",
=A0 =A0 =A0 =A0 =A0 =A0f2 =A0: &q= uot;q",
=A0 =A0 =A0 =A0 =A0 =A0f3 =A0: "= /",
=A0 =A0 =A0 =A0 =A0 =A0...=A0
=A0 =A0 =A0 =A0 =A0},
=A0 =A0 =A0]
=A0 =A0 ... // = other fields omitted =A0=A0
}

I want Pig to GENERATE a tuple for each element on the = "important-data" array attribute. For the example above, I would = like to generate:

( "some-group-identifier" , "some-subgroup-identifier= ", 23, "abc", "a", "/" )
( "some-group-identifier" , "some-s= ubgroup-identifier", 23, "xyz", "q", "/"= )

This is what I have tried:

doc =3D LOAD '/example.json' USING =A0
=A0 =A0 =A0com.twitter.elephantbird.p= ig.load.JsonLoader('-nestedLoad') as (json:map[]);=A0
flat =3D FOREACH doc =A0GENER= ATE =A0(chararray)json#'gr' as g, (long)json#'sg' as sg, = =A0FLATTEN( json#'important-data') ;
DUMP flat;

but that produces:

(=A0"some-group-identifier" , = "some-subgroup-identifier", 23, [ f1#abc, f2#a, f3#/ ] )
(=A0"some-group-identifier&= quot; , "some-subgroup-identifier", 23, [ f1#xyz, f2#q, f3#/ ] )= =A0

Close, but not exactly w= hat I want.=A0

Do I require to use ProtoBuf ?

-Jorge


On Wed, Jun 19, 2013 at 3:44 PM, Tecno B= rain <cerebrotecnologico@gmail.com> wrote:
Ok, I found that elephant-bird JsonLoader cannot handle JS= ON documents that are pretty-printed. (expanding over multiple-lines) The e= ntire json document has to be on a single line.=A0

After I reformated some of the source files, now I am getting the expected = output.




On Wed, Jun 19, 2013 at 2:47 = PM, Tecno Brain <cerebrotecnologico@gmail.com> wr= ote:
I also tried:

doc =3D LOAD '/json-pcr/pcr= -000001.json' USING =A0com.twitter.elephantbird.pig.load.JsonLoader() A= S (json:map[]);
flat =3D FOREACH doc =A0GENERATE =A0(= chararray)json#'a' AS first, (long)json#'b' AS second ; =A0=
DUMP flat;

but I got n= o output either.=A0

=A0 =A0 =A0Input(s):
=A0 =A0 = =A0Successfully read 0 records (35863 bytes) from: "/json-pcr/pcr-0000= 01.json"

=A0 =A0 =A0Output(s):
=A0 =A0 =A0Successfully stored 0 records i= n: "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"


On Wed, Jun 19, 2013 at 2:36 P= M, Tecno Brain <cerebrotecnologico@gmail.com> wro= te:
I got Pig and Hive working ona single-nod= e and I am able to run some script/queries over regular text files (access = log files); with a record per line.=A0

Now, I want to process some JSON files.

As mentioned before, it seems =A0that ElephantBird woul= d be a would be a good solution to read JSON files.=A0

=
I uploaded 5 files to HDFS. Each file only contain a single JSON docum= ent. The documents are NOT in a single line, but rather contain pretty-prin= ted JSON expanding over multiple lines.=A0

I'm trying something simple, extracting two (primit= ive) attributes at the top of the document:
{
= =A0 =A0a : "some value",
=A0 =A0...
=A0 =A0b : 133,
=A0 =A0...=A0
}

So, lets start with a LOAD of a single file (single JSO= N document):

REGISTER 'bunch of JAR files from elephant-bird and its dependencies&#= 39;;
doc =3D LOAD '/json-pcr/pcr-= 000001.json' using =A0com.twitter.elephantbird.pig.load.JsonLoader();= =A0
flat =A0=3D = FOREACH doc GENERATE (chararray)$0#'a' AS =A0first, (long)$0#'b= ' AS second ;
DUMP flat;

Apparently the job runs without problem, but I get = no output. The output I get includes this message:

=A0 =A0Input(s):
=A0 =A0Successfully rea= d 0 records (35863 bytes) from: "/json-pcr/pcr-000001.json"
=

I was expecting to get=A0

( "some value", 133 )

Any idea on what I am doing wrong?




On Thu, Jun 13, 2013 at 3:05 PM, Michael Seg= el <michael_segel@hotmail.com> wrote:
I think you have a misconception of HBa= se.=A0

You don't need to actually have mutable data = for it to be effective.=A0
The key is that you need to have acces= s to specific records and work a very small subset of the data and not the = complete data set.=A0


On Jun 13, 2013, at 11:59 AM, T= ecno Brain <cerebrotecnologico@gmail.com> wrote:

Hi Mike,

Yes, I also have thought= about HBase or Cassandra but my data is pretty much a snapshot, it does no= t require updates. Most of my aggregations will also need to be computed on= ce and won't change over time with the exception of some aggregation th= at is based on the last N days of data. =A0Should I still consider HBase ? = I think that probably it will be good for the aggregated data.=A0

I have no idea what are sequence files, but I will take a lo= ok. =A0My raw data is stored in the cloud, not in my Hadoop cluster.=A0

I'll keep looking at Pig with ElephantBird.=A0
Thanks,

-Jorge=A0




On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <michael_segel@hotmail.com> wrote:
Hi..

Have you thought about HBase?=A0

I would suggest t= hat if you're using Hive or Pig, to look at taking these files and putt= ing the JSON records in to a sequence file.=A0
Or set of sequence files.... (Then look at HBase to help index them...= ) 200KB is small.=A0

That would be the same for ei= ther pig/hive.

In terms of SerDes, I've worked= w Pig and ElephantBird, its pretty nice. And yes you get each record as a = row, however you can always flatten them as needed.=A0

Hive?=A0
I haven't worked with the latest= SerDe, but maybe Dean Wampler or Edward Capriolo could give you a better a= nswer.=A0
Going from memory, I don't know that there is a goo= d SerDe that would write JSON, just read it. (Hive)

IMHO Pig/ElephantBird is the best so far, but then agai= n I may be dated and biased.=A0

I think you're= on the right track or at least train of thought.=A0

HTH

-Mike


<= /div>
On Jun 12, 2013, at 7:57 PM, Tecno Brain <cerebrotecnologico@= gmail.com> wrote:

Hello,=A0
=A0 = =A0I'm new to Hadoop.=A0
=A0 =A0I have a large quantity of JSON documents with a struc= ture similar to what is shown below. =A0

=A0 =A0{
=A0 =A0 =A0g : "some-group-identifier",
=A0 =A0 =A0sg: &quo= t;some-subgroup-identifier",
=A0 =A0 =A0j =A0 =A0 =A0: "some-job-identifier",
=A0 =A0 =A0page =A0 =A0 : 23,
=A0 =A0 =A0... // othe= r fields=A0omitted
= =A0 =A0 =A0important-data : [
=A0 =A0 =A0 =A0 =A0{
=A0 =A0 =A0 =A0 =A0 =A0f1 =A0: = "abc",
=A0= =A0 =A0 =A0 =A0 =A0f2 =A0: "a",
=A0 =A0 =A0 =A0 =A0 =A0f3 =A0: &= quot;/"
=A0 =A0= =A0 =A0 =A0 =A0...
= =A0 =A0 =A0 =A0 =A0},
=A0 =A0 =A0 =A0 =A0...
=A0 =A0 =A0 =A0 =A0{
=A0 =A0 =A0 =A0 =A0 =A0f1 : = "xyz",
=A0 =A0 =A0 =A0 =A0 =A0f2 =A0: &= quot;q",
=A0 = =A0 =A0 =A0 =A0 =A0f3 =A0: "/",
=A0 =A0 =A0 =A0 =A0 =A0...=A0
=A0 =A0 =A0 =A0 =A0},
=A0 =A0 =A0],
=A0 =A0 ... // other fields=A0omitte= d=A0
=A0 =A0 =A0other-important-data = : [
=A0 =A0 =A0 =A0 = {
=A0 =A0 =A0 =A0 = =A0 =A0x1 =A0: "ford",
=A0 =A0 =A0 =A0 =A0 =A0x2 =A0: &= quot;green",
= =A0 =A0 =A0 =A0 =A0 =A0x3 =A0: 35
=A0 =A0 =A0 =A0 =A0 =A0map : {
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0&= quot;free-field" : "value",
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0"other-free-fiel= d" : value2"
=A0 =A0 =A0 =A0 =A0 =A0 =A0 }
=A0 =A0 =A0 =A0 =A0},<= /font>
=A0 =A0 =A0 =A0 =A0.= ..
=A0 =A0 =A0 =A0 = =A0{
=A0 =A0 =A0 =A0 =A0 =A0x1 : &quo= t;vw",
=A0 =A0 = =A0 =A0 =A0 =A0x2 =A0: "red",
=A0 =A0 =A0 =A0 =A0 =A0x3 =A0: 54,
=A0 =A0 =A0 =A0 =A0 =A0...=A0
= =A0 =A0 =A0 =A0 =A0},=A0 =A0= =A0
=A0 =A0 =A0]
=A0 =A0},
}
=A0

Each file contains a single JSON document (gzip compressed, a= nd roughly about 200KB uncompressed of pretty-printed json text per documen= t)

I am interested in analyzing only the =A0"= important-data" array and the "other-important-data" array.<= /font>
My source data would ideally be = easier to analyze if it looked like a couple of tables with a fixed set of = columns. Only the column "map" would be a complex column, all oth= ers would be primitives.

( g, sg, j, page, f1, f2, f3 )
=A0
( g, sg, j, pag= e, x1, x2, x3, map )

So, for each JSON document, I would l= ike to "create" several rows, but=A0I would like to avoid the intermediat= e step of persisting -and duplicating- the "flattened" data.

In order to avoid persisting the data flattened= , I thought I had to write my own map-reduce in Java code, but discovered t= hat others have had the same problem of using JSON as the source and there = are somewhat "standard" solutions.=A0

By reading about = the SerDe approach for Hive=A0I get the impression that each JSON document is trans= formed into a single "row" of the table with some columns being a= n array, a map of other nested structures.=A0
a) Is ther= e a way to break each JSON document into several "rows" for a Hiv= e external table?
b) It seems there are too many JSON SerDe libraries! Is= there any of them considered the de-facto standard?=A0

The Pig approach seems al= so promising using Elephant Bird Do anybody has pointers to more user docum= entation on this project? Or is browsing through the examples in GitHub my = only source?

Thanks




<= br>













--20cf307cfc407829d604df9aa2a1--