Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of cerebrotecnologico@gmail.com
 designates 209.85.212.47 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CANj+m=iB5=51StnC0CXNd1pydMKZpNKx8Vh+15RFYmLDfCZs8w@mail.gmail.com>
References: 
 <CANj+m=iOFeUK+HMhgFjcMjDWV0KKGq2tywE-P+gWxz0==yf4Qw@mail.gmail.com>
	<BLU0-SMTP20028AC4CD9E933194604158F870@phx.gbl>
	<CANj+m=j9vjQhHMREfOrQPRBvVr5zFX-LcBKVMxifKUanLRL-cQ@mail.gmail.com>
	<BLU0-SMTP4052109CF9D34DC249C8EA28F870@phx.gbl>
	<CANj+m=iB5=51StnC0CXNd1pydMKZpNKx8Vh+15RFYmLDfCZs8w@mail.gmail.com>
Date: Wed, 19 Jun 2013 14:47:23 -0700
Message-ID: 
 <CANj+m=iL6ti13hh+q_FZSzS0c0x8mYFvV=h05MiGUrtkzarKRA@mail.gmail.com>
Subject: Re: Aggregating data nested into JSON documents
From: Tecno Brain <cerebrotecnologico@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a11c3da0060025d04df88c699

--001a11c3da0060025d04df88c699
Content-Type: text/plain; charset=ISO-8859-1

I also tried:

doc = LOAD '/json-pcr/pcr-000001.json' USING
 com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first, (long)json#'b'
AS second ;
DUMP flat;

but I got no output either.

     Input(s):
     Successfully read 0 records (35863 bytes) from:
"/json-pcr/pcr-000001.json"

     Output(s):
     Successfully stored 0 records in:
"hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"


On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain
<cerebrotecnologico@gmail.com>wrote:

> I got Pig and Hive working ona single-node and I am able to run some
> script/queries over regular text files (access log files); with a record
> per line.
>
> Now, I want to process some JSON files.
>
> As mentioned before, it seems  that ElephantBird would be a would be a
> good solution to read JSON files.
>
> I uploaded 5 files to HDFS. Each file only contain a single JSON document.
> The documents are NOT in a single line, but rather contain pretty-printed
> JSON expanding over multiple lines.
>
> I'm trying something simple, extracting two (primitive) attributes at the
> top of the document:
> {
>    a : "some value",
>    ...
>    b : 133,
>    ...
> }
>
> So, lets start with a LOAD of a single file (single JSON document):
>
> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
> doc = LOAD '/json-pcr/pcr-000001.json' using
>  com.twitter.elephantbird.pig.load.JsonLoader();
> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b' AS
> second ;
> DUMP flat;
>
> Apparently the job runs without problem, but I get no output. The output I
> get includes this message:
>
>    Input(s):
>    Successfully read 0 records (35863 bytes) from:
> "/json-pcr/pcr-000001.json"
>
> I was expecting to get
>
> ( "some value", 133 )
>
> Any idea on what I am doing wrong?
>
>
>
>
> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <michael_segel@hotmail.com>wrote:
>
>> I think you have a misconception of HBase.
>>
>> You don't need to actually have mutable data for it to be effective.
>> The key is that you need to have access to specific records and work a
>> very small subset of the data and not the complete data set.
>>
>>
>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <cerebrotecnologico@gmail.com>
>> wrote:
>>
>> Hi Mike,
>>
>> Yes, I also have thought about HBase or Cassandra but my data is pretty
>> much a snapshot, it does not require updates. Most of my aggregations will
>> also need to be computed once and won't change over time with the exception
>> of some aggregation that is based on the last N days of data.  Should I
>> still consider HBase ? I think that probably it will be good for the
>> aggregated data.
>>
>> I have no idea what are sequence files, but I will take a look.  My raw
>> data is stored in the cloud, not in my Hadoop cluster.
>>
>> I'll keep looking at Pig with ElephantBird.
>> Thanks,
>>
>> -Jorge
>>
>>
>>
>>
>>
>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> Hi..
>>>
>>> Have you thought about HBase?
>>>
>>> I would suggest that if you're using Hive or Pig, to look at taking
>>> these files and putting the JSON records in to a sequence file.
>>> Or set of sequence files.... (Then look at HBase to help index them...)
>>> 200KB is small.
>>>
>>> That would be the same for either pig/hive.
>>>
>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice.
>>> And yes you get each record as a row, however you can always flatten them
>>> as needed.
>>>
>>> Hive?
>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward
>>> Capriolo could give you a better answer.
>>> Going from memory, I don't know that there is a good SerDe that would
>>> write JSON, just read it. (Hive)
>>>
>>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
>>> and biased.
>>>
>>> I think you're on the right track or at least train of thought.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>
>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <cerebrotecnologico@gmail.com>
>>> wrote:
>>>
>>> Hello,
>>>    I'm new to Hadoop.
>>>    I have a large quantity of JSON documents with a structure similar to
>>> what is shown below.
>>>
>>>    {
>>>      g : "some-group-identifier",
>>>      sg: "some-subgroup-identifier",
>>>      j      : "some-job-identifier",
>>>      page     : 23,
>>>      ... // other fields omitted
>>>      important-data : [
>>>          {
>>>            f1  : "abc",
>>>            f2  : "a",
>>>            f3  : "/"
>>>            ...
>>>          },
>>>          ...
>>>          {
>>>            f1 : "xyz",
>>>            f2  : "q",
>>>            f3  : "/",
>>>            ...
>>>          },
>>>      ],
>>>     ... // other fields omitted
>>>      other-important-data : [
>>>         {
>>>            x1  : "ford",
>>>            x2  : "green",
>>>            x3  : 35
>>>            map : {
>>>                "free-field" : "value",
>>>                "other-free-field" : value2"
>>>               }
>>>          },
>>>          ...
>>>          {
>>>            x1 : "vw",
>>>            x2  : "red",
>>>            x3  : 54,
>>>            ...
>>>          },
>>>      ]
>>>    },
>>> }
>>>
>>>
>>> Each file contains a single JSON document (gzip compressed, and roughly
>>> about 200KB uncompressed of pretty-printed json text per document)
>>>
>>> I am interested in analyzing only the  "important-data" array and the
>>> "other-important-data" array.
>>> My source data would ideally be easier to analyze if it looked like a
>>> couple of tables with a fixed set of columns. Only the column "map" would
>>> be a complex column, all others would be primitives.
>>>
>>> ( g, sg, j, page, f1, f2, f3 )
>>>
>>> ( g, sg, j, page, x1, x2, x3, map )
>>>
>>> So, for each JSON document, I would like to "create" several rows, but I
>>> would like to avoid the intermediate step of persisting -and duplicating-
>>> the "flattened" data.
>>>
>>> In order to avoid persisting the data flattened, I thought I had to
>>> write my own map-reduce in Java code, but discovered that others have had
>>> the same problem of using JSON as the source and there are somewhat
>>> "standard" solutions.
>>>
>>> By reading about the SerDe approach for Hive I get the impression that
>>> each JSON document is transformed into a single "row" of the table with
>>> some columns being an array, a map of other nested structures.
>>> a) Is there a way to break each JSON document into several "rows" for a
>>> Hive external table?
>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>> them considered the de-facto standard?
>>>
>>> The Pig approach seems also promising using Elephant Bird Do anybody has
>>> pointers to more user documentation on this project? Or is browsing through
>>> the examples in GitHub my only source?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

--001a11c3da0060025d04df88c699
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I also tried:<div><br></div><div><font face=3D"courier new=
, monospace">doc =3D LOAD &#39;/json-pcr/pcr-000001.json&#39; USING =A0com.=
twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);<br></font></div=
><div>
<font face=3D"courier new, monospace">flat =3D FOREACH doc =A0GENERATE =A0(=
chararray)json#&#39;a&#39; AS first, (long)json#&#39;b&#39; AS second ; =A0=
<br></font></div><div style><font face=3D"courier new, monospace">DUMP flat=
;</font></div>
<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">but I got n=
o output either.=A0</div><div class=3D"gmail_extra"><br></div><div class=3D=
"gmail_extra"><div class=3D"gmail_extra"><font face=3D"courier new, monospa=
ce">=A0 =A0 =A0Input(s):</font></div>
<div class=3D"gmail_extra"><font face=3D"courier new, monospace">=A0 =A0 =
=A0Successfully read 0 records (35863 bytes) from: &quot;/json-pcr/pcr-0000=
01.json&quot;</font></div><div class=3D"gmail_extra"><font face=3D"courier =
new, monospace"><br>
</font></div><div class=3D"gmail_extra"><font face=3D"courier new, monospac=
e">=A0 =A0 =A0Output(s):</font></div><div class=3D"gmail_extra"><font face=
=3D"courier new, monospace">=A0 =A0 =A0Successfully stored 0 records in: &q=
uot;hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210&quot;</font></=
div>
<div class=3D"gmail_extra"><font face=3D"courier new, monospace"><br></font=
></div><div class=3D"gmail_extra"><font face=3D"courier new, monospace"><br=
></font></div><br><div class=3D"gmail_quote">On Wed, Jun 19, 2013 at 2:36 P=
M, Tecno Brain <span dir=3D"ltr">&lt;<a href=3D"mailto:cerebrotecnologico@g=
mail.com" target=3D"_blank">cerebrotecnologico@gmail.com</a>&gt;</span> wro=
te:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div dir=3D"ltr">I got Pig and Hive working ona single-nod=
e and I am able to run some script/queries over regular text files (access =
log files); with a record per line.=A0<div>
<br><div>Now, I want to process some JSON files.</div>
<div><br></div><div>As mentioned before, it seems =A0that ElephantBird woul=
d be a would be a good solution to read JSON files.=A0</div><div><br></div>=
<div>I uploaded 5 files to HDFS. Each file only contain a single JSON docum=
ent. The documents are NOT in a single line, but rather contain pretty-prin=
ted JSON expanding over multiple lines.=A0</div>

<div><br></div><div>I&#39;m trying something simple, extracting two (primit=
ive) attributes at the top of the document:</div><div><font face=3D"courier=
 new, monospace">{</font></div><div><font face=3D"courier new, monospace">=
=A0 =A0a : &quot;some value&quot;,</font></div>

<div><font face=3D"courier new, monospace">=A0 =A0...</font></div><div><fon=
t face=3D"courier new, monospace">=A0 =A0b : 133,</font></div><div><font fa=
ce=3D"courier new, monospace">=A0 =A0...=A0</font></div><div><font face=3D"=
courier new, monospace">}</font></div>

<div><br></div><div>So, lets start with a LOAD of a single file (single JSO=
N document):</div><div><br></div><div><font face=3D"courier new, monospace"=
>REGISTER &#39;bunch of JAR files from elephant-bird and its dependencies&#=
39;;</font></div>

<div><font face=3D"courier new, monospace">doc =3D LOAD &#39;/json-pcr/pcr-=
000001.json&#39; using =A0com.twitter.elephantbird.pig.load.JsonLoader();=
=A0<br></font></div><div><font face=3D"courier new, monospace">flat =A0=3D =
FOREACH doc GENERATE (chararray)$0#&#39;a&#39; AS =A0first, (long)$0#&#39;b=
&#39; AS second ;<br>

</font></div><div><font face=3D"courier new, monospace">DUMP flat;</font></=
div><div><br></div><div>Apparently the job runs without problem, but I get =
no output. The output I get includes this message:</div>
<div><br></div><div><div>=A0 =A0Input(s):</div><div>=A0 =A0Successfully rea=
d 0 records (35863 bytes) from: &quot;/json-pcr/pcr-000001.json&quot;</div>=
</div><div><br></div><div>I was expecting to get=A0</div>
<div><br></div><div>( &quot;some value&quot;, 133 )</div><div><br></div><di=
v>Any idea on what I am doing wrong?</div><div><br></div><div><br></div></d=
iv></div><div class=3D""><div class=3D"h5"><div class=3D"gmail_extra"><br>
<br><div class=3D"gmail_quote">On Thu, Jun 13, 2013 at 3:05 PM, Michael Seg=
el <span dir=3D"ltr">&lt;<a href=3D"mailto:michael_segel@hotmail.com" targe=
t=3D"_blank">michael_segel@hotmail.com</a>&gt;</span> wrote:<br><blockquote=
 class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:=
1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left=
:1ex">

<div style=3D"word-wrap:break-word">I think you have a misconception of HBa=
se.=A0<div><br></div><div>You don&#39;t need to actually have mutable data =
for it to be effective.=A0</div><div>The key is that you need to have acces=
s to specific records and work a very small subset of the data and not the =
complete data set.=A0</div>

<div><div><div><br></div><div><br><div><div>On Jun 13, 2013, at 11:59 AM, T=
ecno Brain &lt;<a href=3D"mailto:cerebrotecnologico@gmail.com" target=3D"_b=
lank">cerebrotecnologico@gmail.com</a>&gt; wrote:</div><br><blockquote type=
=3D"cite">

<div dir=3D"ltr"><div>Hi Mike,</div><div><br></div>Yes, I also have thought=
 about HBase or Cassandra but my data is pretty much a snapshot, it does no=
t require updates. Most of my aggregations will also need to be computed on=
ce and won&#39;t change over time with the exception of some aggregation th=
at is based on the last N days of data. =A0Should I still consider HBase ? =
I think that probably it will be good for the aggregated data.=A0<div>


<br></div><div>I have no idea what are sequence files, but I will take a lo=
ok. =A0My raw data is stored in the cloud, not in my Hadoop cluster.=A0</di=
v><div><br></div><div>I&#39;ll keep looking at Pig with ElephantBird.=A0</d=
iv>


<div>Thanks,</div><div><br></div><div>-Jorge=A0</div><div><br></div><div><b=
r></div><div><br></div></div><div class=3D"gmail_extra"><br><br><div class=
=3D"gmail_quote">On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <span dir=
=3D"ltr">&lt;<a href=3D"mailto:michael_segel@hotmail.com" target=3D"_blank"=
>michael_segel@hotmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div style=3D"word-wrap:break-word">Hi..<div><br></div><di=
v>
Have you thought about HBase?=A0</div><div><br></div><div>I would suggest t=
hat if you&#39;re using Hive or Pig, to look at taking these files and putt=
ing the JSON records in to a sequence file.=A0</div>

<div>Or set of sequence files.... (Then look at HBase to help index them...=
) 200KB is small.=A0</div><div><br></div><div>That would be the same for ei=
ther pig/hive.</div><div><br></div><div>In terms of SerDes, I&#39;ve worked=
 w Pig and ElephantBird, its pretty nice. And yes you get each record as a =
row, however you can always flatten them as needed.=A0</div>


<div><br></div><div>Hive?=A0</div><div>I haven&#39;t worked with the latest=
 SerDe, but maybe Dean Wampler or Edward Capriolo could give you a better a=
nswer.=A0</div><div>Going from memory, I don&#39;t know that there is a goo=
d SerDe that would write JSON, just read it. (Hive)</div>


<div><br></div><div>IMHO Pig/ElephantBird is the best so far, but then agai=
n I may be dated and biased.=A0</div><div><br></div><div>I think you&#39;re=
 on the right track or at least train of thought.=A0</div><div><br></div><d=
iv>


HTH</div><div><br></div><div>-Mike</div><div><div><div><br></div><div><br><=
/div><div><div><div>On Jun 12, 2013, at 7:57 PM, Tecno Brain &lt;<a href=3D=
"mailto:cerebrotecnologico@gmail.com" target=3D"_blank">cerebrotecnologico@=
gmail.com</a>&gt; wrote:</div>


<br><blockquote type=3D"cite"><div dir=3D"ltr"><font face=3D"courier new, m=
onospace">Hello,=A0</font><div><font face=3D"courier new, monospace">=A0 =
=A0I&#39;m new to Hadoop.=A0</font></div><div><div><font face=3D"courier ne=
w, monospace">=A0 =A0I have a large quantity of JSON documents with a struc=
ture similar to what is shown below. =A0</font></div>


<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">=A0 =A0{</font></div><div><font face=3D"courier=
 new, monospace">=A0 =A0 =A0g : &quot;some-group-identifier&quot;,<br>
</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0sg: &quo=
t;some-subgroup-identifier&quot;,</font></div><div><font face=3D"courier ne=
w, monospace">=A0 =A0 =A0j =A0 =A0 =A0: &quot;some-job-identifier&quot;,</f=
ont></div>
<div><font face=3D"courier new, monospace">=A0 =A0 =A0page =A0 =A0 : 23,</f=
ont></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0... // othe=
r fields=A0omitted</font></div><div><font face=3D"courier new, monospace">=
=A0 =A0 =A0important-data : [</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0{</font></div=
><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f1 =A0: =
&quot;abc&quot;,</font></div><div><font face=3D"courier new, monospace">=A0=
 =A0 =A0 =A0 =A0 =A0f2 =A0: &quot;a&quot;,</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f3 =A0: &=
quot;/&quot;</font></div><div><font face=3D"courier new, monospace">=A0 =A0=
 =A0 =A0 =A0 =A0...</font></div><div><font face=3D"courier new, monospace">=
=A0 =A0 =A0 =A0 =A0},</font></div>
<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0...</font></d=
iv><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0{</font></=
div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f1 : =
&quot;xyz&quot;,</font></div>
<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f2 =A0: &=
quot;q&quot;,</font></div><div><font face=3D"courier new, monospace">=A0 =
=A0 =A0 =A0 =A0 =A0f3 =A0: &quot;/&quot;,</font></div><div><font face=3D"co=
urier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0...=A0</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0},</font></di=
v><div><font face=3D"courier new, monospace">=A0 =A0 =A0],</font></div><div=
><font face=3D"courier new, monospace">=A0 =A0 ... // other fields=A0omitte=
d=A0</font></div>
<div><font face=3D"courier new, monospace">=A0 =A0 =A0other-important-data =
: [</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =
{</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =
=A0 =A0x1 =A0: &quot;ford&quot;,</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0x2 =A0: &=
quot;green&quot;,</font></div><div><font face=3D"courier new, monospace">=
=A0 =A0 =A0 =A0 =A0 =A0x3 =A0: 35</font></div><div><font face=3D"courier ne=
w, monospace">=A0 =A0 =A0 =A0 =A0 =A0map : {</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0&=
quot;free-field&quot; : &quot;value&quot;,</font></div><div><font face=3D"c=
ourier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0&quot;other-free-fiel=
d&quot; : value2&quot;</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0 =A0 }</f=
ont></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0},<=
/font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0.=
..</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =
=A0{</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0x1 : &quo=
t;vw&quot;,</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =
=A0 =A0 =A0 =A0x2 =A0: &quot;red&quot;,</font></div><div><font face=3D"cour=
ier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0x3 =A0: 54,</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0...=A0</f=
ont></div><div><span style=3D"font-family:&#39;courier new&#39;,monospace">=
=A0 =A0 =A0 =A0 =A0},</span><font face=3D"courier new, monospace">=A0 =A0=
=A0</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0]</fo=
nt></div>


<div><font face=3D"courier new, monospace">=A0 =A0},</font></div><div><font=
 face=3D"courier new, monospace">}</font></div><div><font face=3D"courier n=
ew, monospace">=A0</font></div><div><br></div><div><font face=3D"courier ne=
w, monospace">Each file contains a single JSON document (gzip compressed, a=
nd roughly about 200KB uncompressed of pretty-printed json text per documen=
t)</font></div>


<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">I am interested in analyzing only the =A0&quot;=
important-data&quot; array and the &quot;other-important-data&quot; array.<=
/font></div>


<div><font face=3D"courier new, monospace">My source data would ideally be =
easier to analyze if it looked like a couple of tables with a fixed set of =
columns. Only the column &quot;map&quot; would be a complex column, all oth=
ers would be primitives.</font></div>


<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">( g, sg, j, page, f1, f2, f3 )</font></div><div=
><font face=3D"courier new, monospace">=A0</font></div><div>
<span style=3D"font-family:&#39;courier new&#39;,monospace">( g, sg, j, pag=
e, x1, x2, x3, map )</span><font face=3D"courier new, monospace"><br></font=
></div><div><font face=3D"courier new, monospace"><br></font></div><div>
<font face=3D"courier new, monospace">So, for each JSON document, I would l=
ike to &quot;create&quot; several rows, but=A0</font><span style=3D"font-fa=
mily:&#39;courier new&#39;,monospace">I would like to avoid the intermediat=
e step of persisting -and duplicating- the &quot;flattened&quot; data.</spa=
n></div>


<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">In order to avoid persisting the data flattened=
, I thought I had to write my own map-reduce in Java code, but discovered t=
hat others have had the same problem of using JSON as the source and there =
are somewhat &quot;standard&quot; solutions.=A0</font></div>


<div><br></div><div><font face=3D"courier new, monospace">By reading about =
the SerDe approach for Hive</font><span style=3D"font-family:&#39;courier n=
ew&#39;,monospace">=A0I get the impression that each JSON document is trans=
formed into a single &quot;row&quot; of the table with some columns being a=
n array, a map of other nested structures.=A0</span></div>


<div><span style=3D"font-family:&#39;courier new&#39;,monospace">a) Is ther=
e a way to break each JSON document into several &quot;rows&quot; for a Hiv=
e external table?</span></div><div><span style=3D"font-family:&#39;courier =
new&#39;,monospace">b) It seems there are too many JSON SerDe libraries! Is=
 there any of them considered the de-facto standard?=A0</span></div>


<div><span style=3D"font-family:&#39;courier new&#39;,monospace"><br></span=
></div><div><font face=3D"courier new, monospace">The Pig approach seems al=
so promising using Elephant Bird Do anybody has pointers to more user docum=
entation on this project? Or is browsing through the examples in GitHub my =
only source?</font></div>


<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">Thanks</font></div><div><font face=3D"courier n=
ew, monospace"><br></font></div><div><br></div><div>
<span style=3D"font-family:&#39;courier new&#39;,monospace"><br></span></di=
v><div><span style=3D"font-family:&#39;courier new&#39;,monospace"><br></sp=
an></div><div><span style=3D"font-family:&#39;courier new&#39;,monospace"><=
br>


</span></div><div><font face=3D"courier new, monospace"><br></font></div><d=
iv><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace"><br></font></div><div><font face=3D"courier new=
, monospace"><br>


</font></div><div><font face=3D"courier new, monospace"><br></font></div></=
div></div>
</blockquote></div><br></div></div></div></div></blockquote></div><br></div=
>
</blockquote></div><br></div></div></div></div></blockquote></div><br></div=
>
</div></div></blockquote></div><br></div></div>

--001a11c3da0060025d04df88c699--