Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of cerebrotecnologico@gmail.com
 designates 209.85.128.179 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CANj+m=icARbmpXsgPbMunxRBwtcwBL8OXfAPqpcyzRvAdH_B0w@mail.gmail.com>
References: 
 <CANj+m=iOFeUK+HMhgFjcMjDWV0KKGq2tywE-P+gWxz0==yf4Qw@mail.gmail.com>
	<BLU0-SMTP20028AC4CD9E933194604158F870@phx.gbl>
	<CANj+m=j9vjQhHMREfOrQPRBvVr5zFX-LcBKVMxifKUanLRL-cQ@mail.gmail.com>
	<BLU0-SMTP4052109CF9D34DC249C8EA28F870@phx.gbl>
	<CANj+m=iB5=51StnC0CXNd1pydMKZpNKx8Vh+15RFYmLDfCZs8w@mail.gmail.com>
	<CANj+m=iL6ti13hh+q_FZSzS0c0x8mYFvV=h05MiGUrtkzarKRA@mail.gmail.com>
	<CANj+m=h21PW+Ydy103Cvox3fHrcCvwfHLjiSsNkfGxirafrUtg@mail.gmail.com>
	<CANj+m=icARbmpXsgPbMunxRBwtcwBL8OXfAPqpcyzRvAdH_B0w@mail.gmail.com>
Date: Thu, 20 Jun 2013 12:05:50 -0700
Message-ID: 
 <CANj+m=hBbPn-iSw4iP8V1u6tmvjQ5FMK7dTJQrSQWVMush=48w@mail.gmail.com>
Subject: Re: Aggregating data nested into JSON documents
From: Tecno Brain <cerebrotecnologico@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=20cf307cfc407829d604df9aa2a1

--20cf307cfc407829d604df9aa2a1
Content-Type: text/plain; charset=ISO-8859-1

Never mind, I got the solution!

uberflat = FOREACH flat GENERATE g, sg,
              FLATTEN(important-data#'f1') as f1,
              FLATTEN(important-data#'f2') as f2;

-Jorge


On Thu, Jun 20, 2013 at 11:54 AM, Tecno Brain
<cerebrotecnologico@gmail.com>wrote:

> OK, I'll go back to my original question ( although this time I know what
> tools I'm using).
>
> I am using Pig + ElephantBird.
>
> I have JSON documents with the following structure:
> {
>      g : "some-group-identifier",
>      sg: "some-subgroup-identifier",
>      j      : "some-job-identifier",
>      page     : 23,
>      ... // other fields omitted
>      important-data : [
>          {
>            f1  : "abc",
>            f2  : "a",
>            f3  : "/"
>            ...
>          },
>          ...
>          {
>            f1 : "xyz",
>            f2  : "q",
>            f3  : "/",
>            ...
>          },
>      ]
>     ... // other fields omitted
> }
>
> I want Pig to GENERATE a tuple for each element on the "important-data"
> array attribute. For the example above, I would like to generate:
>
> ( "some-group-identifier" , "some-subgroup-identifier", 23, "abc", "a",
> "/" )
> ( "some-group-identifier" , "some-subgroup-identifier", 23, "xyz", "q",
> "/" )
>
> This is what I have tried:
>
> doc = LOAD '/example.json' USING
>      com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as
> (json:map[]);
> flat = FOREACH doc  GENERATE  (chararray)json#'gr' as g, (long)json#'sg'
> as sg,  FLATTEN( json#'important-data') ;
> DUMP flat;
>
> but that produces:
>
> ( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#abc,
> f2#a, f3#/ ] )
> ( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#xyz,
> f2#q, f3#/ ] )
>
> Close, but not exactly what I want.
>
> Do I require to use ProtoBuf ?
>
> -Jorge
>
>
> On Wed, Jun 19, 2013 at 3:44 PM, Tecno Brain <cerebrotecnologico@gmail.com
> > wrote:
>
>> Ok, I found that elephant-bird JsonLoader cannot handle JSON documents
>> that are pretty-printed. (expanding over multiple-lines) The entire json
>> document has to be on a single line.
>>
>> After I reformated some of the source files, now I am getting the
>> expected output.
>>
>>
>>
>>
>> On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain <
>> cerebrotecnologico@gmail.com> wrote:
>>
>>> I also tried:
>>>
>>> doc = LOAD '/json-pcr/pcr-000001.json' USING
>>>  com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
>>>  flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first,
>>> (long)json#'b' AS second ;
>>> DUMP flat;
>>>
>>> but I got no output either.
>>>
>>>      Input(s):
>>>      Successfully read 0 records (35863 bytes) from:
>>> "/json-pcr/pcr-000001.json"
>>>
>>>      Output(s):
>>>      Successfully stored 0 records in:
>>> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>>>
>>>
>>>
>>> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <
>>> cerebrotecnologico@gmail.com> wrote:
>>>
>>>> I got Pig and Hive working ona single-node and I am able to run some
>>>> script/queries over regular text files (access log files); with a record
>>>> per line.
>>>>
>>>> Now, I want to process some JSON files.
>>>>
>>>> As mentioned before, it seems  that ElephantBird would be a would be a
>>>> good solution to read JSON files.
>>>>
>>>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>>>> document. The documents are NOT in a single line, but rather contain
>>>> pretty-printed JSON expanding over multiple lines.
>>>>
>>>> I'm trying something simple, extracting two (primitive) attributes at
>>>> the top of the document:
>>>> {
>>>>    a : "some value",
>>>>    ...
>>>>    b : 133,
>>>>    ...
>>>> }
>>>>
>>>> So, lets start with a LOAD of a single file (single JSON document):
>>>>
>>>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>>>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>>>  com.twitter.elephantbird.pig.load.JsonLoader();
>>>> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b'
>>>> AS second ;
>>>> DUMP flat;
>>>>
>>>> Apparently the job runs without problem, but I get no output. The
>>>> output I get includes this message:
>>>>
>>>>    Input(s):
>>>>    Successfully read 0 records (35863 bytes) from:
>>>> "/json-pcr/pcr-000001.json"
>>>>
>>>> I was expecting to get
>>>>
>>>> ( "some value", 133 )
>>>>
>>>> Any idea on what I am doing wrong?
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> I think you have a misconception of HBase.
>>>>>
>>>>> You don't need to actually have mutable data for it to be effective.
>>>>> The key is that you need to have access to specific records and work a
>>>>> very small subset of the data and not the complete data set.
>>>>>
>>>>>
>>>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <
>>>>> cerebrotecnologico@gmail.com> wrote:
>>>>>
>>>>> Hi Mike,
>>>>>
>>>>> Yes, I also have thought about HBase or Cassandra but my data is
>>>>> pretty much a snapshot, it does not require updates. Most of my
>>>>> aggregations will also need to be computed once and won't change over time
>>>>> with the exception of some aggregation that is based on the last N days of
>>>>> data.  Should I still consider HBase ? I think that probably it will be
>>>>> good for the aggregated data.
>>>>>
>>>>> I have no idea what are sequence files, but I will take a look.  My
>>>>> raw data is stored in the cloud, not in my Hadoop cluster.
>>>>>
>>>>> I'll keep looking at Pig with ElephantBird.
>>>>> Thanks,
>>>>>
>>>>> -Jorge
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>>>> michael_segel@hotmail.com> wrote:
>>>>>
>>>>>> Hi..
>>>>>>
>>>>>> Have you thought about HBase?
>>>>>>
>>>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>>>> these files and putting the JSON records in to a sequence file.
>>>>>> Or set of sequence files.... (Then look at HBase to help index
>>>>>> them...) 200KB is small.
>>>>>>
>>>>>> That would be the same for either pig/hive.
>>>>>>
>>>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>>>> nice. And yes you get each record as a row, however you can always flatten
>>>>>> them as needed.
>>>>>>
>>>>>> Hive?
>>>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>>>> Edward Capriolo could give you a better answer.
>>>>>> Going from memory, I don't know that there is a good SerDe that would
>>>>>> write JSON, just read it. (Hive)
>>>>>>
>>>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be
>>>>>> dated and biased.
>>>>>>
>>>>>> I think you're on the right track or at least train of thought.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> -Mike
>>>>>>
>>>>>>
>>>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <
>>>>>> cerebrotecnologico@gmail.com> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>    I'm new to Hadoop.
>>>>>>    I have a large quantity of JSON documents with a structure similar
>>>>>> to what is shown below.
>>>>>>
>>>>>>    {
>>>>>>      g : "some-group-identifier",
>>>>>>      sg: "some-subgroup-identifier",
>>>>>>      j      : "some-job-identifier",
>>>>>>      page     : 23,
>>>>>>      ... // other fields omitted
>>>>>>      important-data : [
>>>>>>          {
>>>>>>            f1  : "abc",
>>>>>>            f2  : "a",
>>>>>>            f3  : "/"
>>>>>>            ...
>>>>>>          },
>>>>>>          ...
>>>>>>          {
>>>>>>            f1 : "xyz",
>>>>>>            f2  : "q",
>>>>>>            f3  : "/",
>>>>>>            ...
>>>>>>          },
>>>>>>      ],
>>>>>>     ... // other fields omitted
>>>>>>      other-important-data : [
>>>>>>         {
>>>>>>            x1  : "ford",
>>>>>>            x2  : "green",
>>>>>>            x3  : 35
>>>>>>            map : {
>>>>>>                "free-field" : "value",
>>>>>>                "other-free-field" : value2"
>>>>>>               }
>>>>>>          },
>>>>>>          ...
>>>>>>          {
>>>>>>            x1 : "vw",
>>>>>>            x2  : "red",
>>>>>>            x3  : 54,
>>>>>>            ...
>>>>>>          },
>>>>>>      ]
>>>>>>    },
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Each file contains a single JSON document (gzip compressed, and
>>>>>> roughly about 200KB uncompressed of pretty-printed json text per document)
>>>>>>
>>>>>> I am interested in analyzing only the  "important-data" array and the
>>>>>> "other-important-data" array.
>>>>>> My source data would ideally be easier to analyze if it looked like a
>>>>>> couple of tables with a fixed set of columns. Only the column "map" would
>>>>>> be a complex column, all others would be primitives.
>>>>>>
>>>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>>>
>>>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>>>
>>>>>> So, for each JSON document, I would like to "create" several rows,
>>>>>> but I would like to avoid the intermediate step of persisting -and
>>>>>> duplicating- the "flattened" data.
>>>>>>
>>>>>> In order to avoid persisting the data flattened, I thought I had to
>>>>>> write my own map-reduce in Java code, but discovered that others have had
>>>>>> the same problem of using JSON as the source and there are somewhat
>>>>>> "standard" solutions.
>>>>>>
>>>>>> By reading about the SerDe approach for Hive I get the impression
>>>>>> that each JSON document is transformed into a single "row" of the table
>>>>>> with some columns being an array, a map of other nested structures.
>>>>>> a) Is there a way to break each JSON document into several "rows" for
>>>>>> a Hive external table?
>>>>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>>>>> them considered the de-facto standard?
>>>>>>
>>>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>>>> has pointers to more user documentation on this project? Or is browsing
>>>>>> through the examples in GitHub my only source?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

--20cf307cfc407829d604df9aa2a1
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Never mind, I got the solution!<br><div><div><br></div><di=
v><font face=3D"courier new, monospace">uberflat =3D FOREACH flat GENERATE =
g, sg,=A0</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =
=A0 =A0 =A0 =A0 =A0 FLATTEN(important-data#&#39;f1&#39;) as f1,=A0</font></=
div>
<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0 =A0 FLAT=
TEN(important-data#&#39;f2&#39;) as f2;</font><br></div><div><br></div></di=
v><div style>-Jorge</div></div><div class=3D"gmail_extra"><br><br><div clas=
s=3D"gmail_quote">
On Thu, Jun 20, 2013 at 11:54 AM, Tecno Brain <span dir=3D"ltr">&lt;<a href=
=3D"mailto:cerebrotecnologico@gmail.com" target=3D"_blank">cerebrotecnologi=
co@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir=3D"ltr">OK, I&#39;ll go back to my original question ( although th=
is time I know what tools I&#39;m using).<div><br></div><div>I am using Pig=
 + ElephantBird.=A0<div><br></div><div>I have JSON documents with the follo=
wing structure:</div>

<div><div class=3D"im"><div style=3D"font-family:arial,sans-serif;font-size=
:13px"><font face=3D"courier new, monospace">{</font></div><div style=3D"fo=
nt-family:arial,sans-serif;font-size:13px"><font face=3D"courier new, monos=
pace">=A0 =A0 =A0g : &quot;some-group-identifier&quot;,<br>

</font></div><div style=3D"font-family:arial,sans-serif;font-size:13px"><fo=
nt face=3D"courier new, monospace">=A0 =A0 =A0sg: &quot;some-subgroup-ident=
ifier&quot;,</font></div><div style=3D"font-family:arial,sans-serif;font-si=
ze:13px">

<font face=3D"courier new, monospace">=A0 =A0 =A0j =A0 =A0 =A0: &quot;some-=
job-identifier&quot;,</font></div><div style=3D"font-family:arial,sans-seri=
f;font-size:13px"><font face=3D"courier new, monospace">=A0 =A0 =A0page =A0=
 =A0 : 23,</font></div>
<div style=3D"font-family:arial,sans-serif;font-size:13px">
<font face=3D"courier new, monospace">=A0 =A0 =A0... // other fields=A0omit=
ted</font></div><div style=3D"font-family:arial,sans-serif;font-size:13px">=
<font face=3D"courier new, monospace">=A0 =A0 =A0important-data : [</font><=
/div><div style=3D"font-family:arial,sans-serif;font-size:13px">

<font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0{</font></div><div=
 style=3D"font-family:arial,sans-serif;font-size:13px"><font face=3D"courie=
r new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f1 =A0: &quot;abc&quot;,</font></d=
iv><div style=3D"font-family:arial,sans-serif;font-size:13px">

<font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f2 =A0: &quot;=
a&quot;,</font></div><div style=3D"font-family:arial,sans-serif;font-size:1=
3px"><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f3 =A0: &=
quot;/&quot;</font></div><div style=3D"font-family:arial,sans-serif;font-si=
ze:13px">

<font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0...</font></di=
v><div style=3D"font-family:arial,sans-serif;font-size:13px"><font face=3D"=
courier new, monospace">=A0 =A0 =A0 =A0 =A0},</font></div><div style=3D"fon=
t-family:arial,sans-serif;font-size:13px">

<font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0...</font></div><d=
iv style=3D"font-family:arial,sans-serif;font-size:13px"><font face=3D"cour=
ier new, monospace">=A0 =A0 =A0 =A0 =A0{</font></div><div style=3D"font-fam=
ily:arial,sans-serif;font-size:13px">

<font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f1 : &quot;xyz=
&quot;,</font></div><div style=3D"font-family:arial,sans-serif;font-size:13=
px"><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f2 =A0: &q=
uot;q&quot;,</font></div><div style=3D"font-family:arial,sans-serif;font-si=
ze:13px">

<font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f3 =A0: &quot;=
/&quot;,</font></div><div style=3D"font-family:arial,sans-serif;font-size:1=
3px"><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0...=A0</f=
ont></div><div style=3D"font-family:arial,sans-serif;font-size:13px">

<font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0},</font></div><di=
v style=3D"font-family:arial,sans-serif;font-size:13px"><font face=3D"couri=
er new, monospace">=A0 =A0 =A0]</font></div><div style=3D"font-family:arial=
,sans-serif;font-size:13px">

<span style=3D"font-family:&#39;courier new&#39;,monospace">=A0 =A0 ... // =
other fields omitted =A0=A0</span></div><div style=3D"font-family:arial,san=
s-serif;font-size:13px"><font face=3D"courier new, monospace">}</font></div=
><div style=3D"font-family:arial,sans-serif;font-size:13px">

<font face=3D"courier new, monospace"><br></font></div></div><div style=3D"=
font-family:arial,sans-serif;font-size:13px"><span style=3D"font-family:ari=
al;font-size:small">I want Pig to GENERATE a tuple for each element on the =
&quot;important-data&quot; array attribute. For the example above, I would =
like to generate:</span><br>

</div><div style=3D"font-family:arial,sans-serif;font-size:13px"><span styl=
e=3D"font-family:arial;font-size:small"><br></span></div><div style=3D"font=
-size:13px"><span style=3D"font-size:small"><font face=3D"courier new, mono=
space">( &quot;some-group-identifier&quot; , &quot;some-subgroup-identifier=
&quot;, 23, &quot;abc&quot;, &quot;a&quot;, &quot;/&quot; )</font></span></=
div>

<div style=3D"font-size:13px"><span style=3D"font-size:small"><font face=3D=
"courier new, monospace">( &quot;some-group-identifier&quot; , &quot;some-s=
ubgroup-identifier&quot;, 23, &quot;xyz&quot;, &quot;q&quot;, &quot;/&quot;=
 )</font></span><span style=3D"font-family:arial;font-size:small"><br>

</span></div><div style=3D"font-family:arial,sans-serif;font-size:13px"><sp=
an style=3D"font-family:arial;font-size:small"><br></span></div><div style=
=3D"font-family:arial,sans-serif;font-size:13px"><span style=3D"font-family=
:arial;font-size:small">This is what I have tried:</span></div>

<div style=3D"font-family:arial,sans-serif;font-size:13px"><span style=3D"f=
ont-family:arial;font-size:small"><br></span></div><div><font face=3D"couri=
er new, monospace">doc =3D LOAD &#39;/example.json&#39; USING =A0</font></d=
iv><div>

<font face=3D"courier new, monospace">=A0 =A0 =A0com.twitter.elephantbird.p=
ig.load.JsonLoader(&#39;-nestedLoad&#39;) as (json:map[]);=A0<br></font></d=
iv><div><font face=3D"courier new, monospace">flat =3D FOREACH doc =A0GENER=
ATE =A0(chararray)json#&#39;gr&#39; as g, (long)json#&#39;sg&#39; as sg, =
=A0FLATTEN( json#&#39;important-data&#39;) ;</font></div>

<div><font face=3D"courier new, monospace">DUMP flat;</font></div><div><spa=
n style=3D"font-family:arial,sans-serif"><br></span></div><div><span style=
=3D"font-family:arial,sans-serif">but that produces:</span></div>
<div><span style=3D"font-family:arial,sans-serif"><br></span></div><div><fo=
nt face=3D"courier new, monospace">(=A0&quot;some-group-identifier&quot; , =
&quot;some-subgroup-identifier&quot;, 23, [ f1#abc, f2#a, f3#/ ] )</font></=
div>

<div><font face=3D"courier new, monospace">(=A0&quot;some-group-identifier&=
quot; , &quot;some-subgroup-identifier&quot;, 23, [ f1#xyz, f2#q, f3#/ ] )=
=A0</font><br></div><div><span style=3D"font-family:arial,sans-serif"><br>
</span></div><div><font face=3D"arial, sans-serif">Close, but not exactly w=
hat I want.=A0</font></div><div><font face=3D"arial, sans-serif"><br></font=
></div><div>Do I require to use ProtoBuf ?</div><div style=3D"font-family:a=
rial,sans-serif;font-size:13px">

<span style=3D"font-family:arial;font-size:small"><br></span></div><div sty=
le=3D"font-family:arial,sans-serif;font-size:13px"><span style=3D"font-fami=
ly:arial;font-size:small">-Jorge</span></div></div></div></div><div class=
=3D"HOEnZb">
<div class=3D"h5"><div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Wed, Jun 19, 2013 at 3:44 PM, Tecno B=
rain <span dir=3D"ltr">&lt;<a href=3D"mailto:cerebrotecnologico@gmail.com" =
target=3D"_blank">cerebrotecnologico@gmail.com</a>&gt;</span> wrote:<br><bl=
ockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #=
ccc solid;padding-left:1ex">

<div dir=3D"ltr">Ok, I found that elephant-bird JsonLoader cannot handle JS=
ON documents that are pretty-printed. (expanding over multiple-lines) The e=
ntire json document has to be on a single line.=A0<div><br></div><div>
After I reformated some of the source files, now I am getting the expected =
output.</div><div><br></div><div><br></div></div><div><div><div class=3D"gm=
ail_extra"><br><br><div class=3D"gmail_quote">On Wed, Jun 19, 2013 at 2:47 =
PM, Tecno Brain <span dir=3D"ltr">&lt;<a href=3D"mailto:cerebrotecnologico@=
gmail.com" target=3D"_blank">cerebrotecnologico@gmail.com</a>&gt;</span> wr=
ote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">I also tried:<div><br></div=
><div><font face=3D"courier new, monospace">doc =3D LOAD &#39;/json-pcr/pcr=
-000001.json&#39; USING =A0com.twitter.elephantbird.pig.load.JsonLoader() A=
S (json:map[]);<br>


</font></div><div>
<font face=3D"courier new, monospace">flat =3D FOREACH doc =A0GENERATE =A0(=
chararray)json#&#39;a&#39; AS first, (long)json#&#39;b&#39; AS second ; =A0=
<br></font></div><div><font face=3D"courier new, monospace">DUMP flat;</fon=
t></div>


<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">but I got n=
o output either.=A0</div><div class=3D"gmail_extra"><br></div><div class=3D=
"gmail_extra"><div><div class=3D"gmail_extra"><font face=3D"courier new, mo=
nospace">=A0 =A0 =A0Input(s):</font></div>


<div class=3D"gmail_extra"><font face=3D"courier new, monospace">=A0 =A0 =
=A0Successfully read 0 records (35863 bytes) from: &quot;/json-pcr/pcr-0000=
01.json&quot;</font></div><div class=3D"gmail_extra"><font face=3D"courier =
new, monospace"><br>


</font></div></div><div class=3D"gmail_extra"><font face=3D"courier new, mo=
nospace">=A0 =A0 =A0Output(s):</font></div><div class=3D"gmail_extra"><font=
 face=3D"courier new, monospace">=A0 =A0 =A0Successfully stored 0 records i=
n: &quot;hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210&quot;</fo=
nt></div>


<div><div>
<div class=3D"gmail_extra"><font face=3D"courier new, monospace"><br></font=
></div><div class=3D"gmail_extra"><font face=3D"courier new, monospace"><br=
></font></div><br><div class=3D"gmail_quote">On Wed, Jun 19, 2013 at 2:36 P=
M, Tecno Brain <span dir=3D"ltr">&lt;<a href=3D"mailto:cerebrotecnologico@g=
mail.com" target=3D"_blank">cerebrotecnologico@gmail.com</a>&gt;</span> wro=
te:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div dir=3D"ltr">I got Pig and Hive working ona single-nod=
e and I am able to run some script/queries over regular text files (access =
log files); with a record per line.=A0<div>


<br><div>Now, I want to process some JSON files.</div>
<div><br></div><div>As mentioned before, it seems =A0that ElephantBird woul=
d be a would be a good solution to read JSON files.=A0</div><div><br></div>=
<div>I uploaded 5 files to HDFS. Each file only contain a single JSON docum=
ent. The documents are NOT in a single line, but rather contain pretty-prin=
ted JSON expanding over multiple lines.=A0</div>


<div><br></div><div>I&#39;m trying something simple, extracting two (primit=
ive) attributes at the top of the document:</div><div><font face=3D"courier=
 new, monospace">{</font></div><div><font face=3D"courier new, monospace">=
=A0 =A0a : &quot;some value&quot;,</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0...</font></div><div><fon=
t face=3D"courier new, monospace">=A0 =A0b : 133,</font></div><div><font fa=
ce=3D"courier new, monospace">=A0 =A0...=A0</font></div><div><font face=3D"=
courier new, monospace">}</font></div>


<div><br></div><div>So, lets start with a LOAD of a single file (single JSO=
N document):</div><div><br></div><div><font face=3D"courier new, monospace"=
>REGISTER &#39;bunch of JAR files from elephant-bird and its dependencies&#=
39;;</font></div>


<div><font face=3D"courier new, monospace">doc =3D LOAD &#39;/json-pcr/pcr-=
000001.json&#39; using =A0com.twitter.elephantbird.pig.load.JsonLoader();=
=A0<br></font></div><div><font face=3D"courier new, monospace">flat =A0=3D =
FOREACH doc GENERATE (chararray)$0#&#39;a&#39; AS =A0first, (long)$0#&#39;b=
&#39; AS second ;<br>


</font></div><div><font face=3D"courier new, monospace">DUMP flat;</font></=
div><div><br></div><div>Apparently the job runs without problem, but I get =
no output. The output I get includes this message:</div>
<div><br></div><div><div>=A0 =A0Input(s):</div><div>=A0 =A0Successfully rea=
d 0 records (35863 bytes) from: &quot;/json-pcr/pcr-000001.json&quot;</div>=
</div><div><br></div><div>I was expecting to get=A0</div>
<div><br></div><div>( &quot;some value&quot;, 133 )</div><div><br></div><di=
v>Any idea on what I am doing wrong?</div><div><br></div><div><br></div></d=
iv></div><div><div><div class=3D"gmail_extra"><br>
<br><div class=3D"gmail_quote">On Thu, Jun 13, 2013 at 3:05 PM, Michael Seg=
el <span dir=3D"ltr">&lt;<a href=3D"mailto:michael_segel@hotmail.com" targe=
t=3D"_blank">michael_segel@hotmail.com</a>&gt;</span> wrote:<br><blockquote=
 class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:=
1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left=
:1ex">


<div style=3D"word-wrap:break-word">I think you have a misconception of HBa=
se.=A0<div><br></div><div>You don&#39;t need to actually have mutable data =
for it to be effective.=A0</div><div>The key is that you need to have acces=
s to specific records and work a very small subset of the data and not the =
complete data set.=A0</div>


<div><div><div><br></div><div><br><div><div>On Jun 13, 2013, at 11:59 AM, T=
ecno Brain &lt;<a href=3D"mailto:cerebrotecnologico@gmail.com" target=3D"_b=
lank">cerebrotecnologico@gmail.com</a>&gt; wrote:</div><br><blockquote type=
=3D"cite">


<div dir=3D"ltr"><div>Hi Mike,</div><div><br></div>Yes, I also have thought=
 about HBase or Cassandra but my data is pretty much a snapshot, it does no=
t require updates. Most of my aggregations will also need to be computed on=
ce and won&#39;t change over time with the exception of some aggregation th=
at is based on the last N days of data. =A0Should I still consider HBase ? =
I think that probably it will be good for the aggregated data.=A0<div>


<br></div><div>I have no idea what are sequence files, but I will take a lo=
ok. =A0My raw data is stored in the cloud, not in my Hadoop cluster.=A0</di=
v><div><br></div><div>I&#39;ll keep looking at Pig with ElephantBird.=A0</d=
iv>


<div>Thanks,</div><div><br></div><div>-Jorge=A0</div><div><br></div><div><b=
r></div><div><br></div></div><div class=3D"gmail_extra"><br><br><div class=
=3D"gmail_quote">On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <span dir=
=3D"ltr">&lt;<a href=3D"mailto:michael_segel@hotmail.com" target=3D"_blank"=
>michael_segel@hotmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div style=3D"word-wrap:break-word">Hi..<div><br></div><di=
v>


Have you thought about HBase?=A0</div><div><br></div><div>I would suggest t=
hat if you&#39;re using Hive or Pig, to look at taking these files and putt=
ing the JSON records in to a sequence file.=A0</div>

<div>Or set of sequence files.... (Then look at HBase to help index them...=
) 200KB is small.=A0</div><div><br></div><div>That would be the same for ei=
ther pig/hive.</div><div><br></div><div>In terms of SerDes, I&#39;ve worked=
 w Pig and ElephantBird, its pretty nice. And yes you get each record as a =
row, however you can always flatten them as needed.=A0</div>


<div><br></div><div>Hive?=A0</div><div>I haven&#39;t worked with the latest=
 SerDe, but maybe Dean Wampler or Edward Capriolo could give you a better a=
nswer.=A0</div><div>Going from memory, I don&#39;t know that there is a goo=
d SerDe that would write JSON, just read it. (Hive)</div>


<div><br></div><div>IMHO Pig/ElephantBird is the best so far, but then agai=
n I may be dated and biased.=A0</div><div><br></div><div>I think you&#39;re=
 on the right track or at least train of thought.=A0</div><div><br></div><d=
iv>


HTH</div><div><br></div><div>-Mike</div><div><div><div><br></div><div><br><=
/div><div><div><div>On Jun 12, 2013, at 7:57 PM, Tecno Brain &lt;<a href=3D=
"mailto:cerebrotecnologico@gmail.com" target=3D"_blank">cerebrotecnologico@=
gmail.com</a>&gt; wrote:</div>


<br><blockquote type=3D"cite"><div dir=3D"ltr"><font face=3D"courier new, m=
onospace">Hello,=A0</font><div><font face=3D"courier new, monospace">=A0 =
=A0I&#39;m new to Hadoop.=A0</font></div><div><div><font face=3D"courier ne=
w, monospace">=A0 =A0I have a large quantity of JSON documents with a struc=
ture similar to what is shown below. =A0</font></div>


<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">=A0 =A0{</font></div><div><font face=3D"courier=
 new, monospace">=A0 =A0 =A0g : &quot;some-group-identifier&quot;,<br>
</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0sg: &quo=
t;some-subgroup-identifier&quot;,</font></div><div><font face=3D"courier ne=
w, monospace">=A0 =A0 =A0j =A0 =A0 =A0: &quot;some-job-identifier&quot;,</f=
ont></div>
<div><font face=3D"courier new, monospace">=A0 =A0 =A0page =A0 =A0 : 23,</f=
ont></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0... // othe=
r fields=A0omitted</font></div><div><font face=3D"courier new, monospace">=
=A0 =A0 =A0important-data : [</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0{</font></div=
><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f1 =A0: =
&quot;abc&quot;,</font></div><div><font face=3D"courier new, monospace">=A0=
 =A0 =A0 =A0 =A0 =A0f2 =A0: &quot;a&quot;,</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f3 =A0: &=
quot;/&quot;</font></div><div><font face=3D"courier new, monospace">=A0 =A0=
 =A0 =A0 =A0 =A0...</font></div><div><font face=3D"courier new, monospace">=
=A0 =A0 =A0 =A0 =A0},</font></div>
<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0...</font></d=
iv><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0{</font></=
div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f1 : =
&quot;xyz&quot;,</font></div>
<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f2 =A0: &=
quot;q&quot;,</font></div><div><font face=3D"courier new, monospace">=A0 =
=A0 =A0 =A0 =A0 =A0f3 =A0: &quot;/&quot;,</font></div><div><font face=3D"co=
urier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0...=A0</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0},</font></di=
v><div><font face=3D"courier new, monospace">=A0 =A0 =A0],</font></div><div=
><font face=3D"courier new, monospace">=A0 =A0 ... // other fields=A0omitte=
d=A0</font></div>
<div><font face=3D"courier new, monospace">=A0 =A0 =A0other-important-data =
: [</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =
{</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =
=A0 =A0x1 =A0: &quot;ford&quot;,</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0x2 =A0: &=
quot;green&quot;,</font></div><div><font face=3D"courier new, monospace">=
=A0 =A0 =A0 =A0 =A0 =A0x3 =A0: 35</font></div><div><font face=3D"courier ne=
w, monospace">=A0 =A0 =A0 =A0 =A0 =A0map : {</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0&=
quot;free-field&quot; : &quot;value&quot;,</font></div><div><font face=3D"c=
ourier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0&quot;other-free-fiel=
d&quot; : value2&quot;</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0 =A0 }</f=
ont></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0},<=
/font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0.=
..</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =
=A0{</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0x1 : &quo=
t;vw&quot;,</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =
=A0 =A0 =A0 =A0x2 =A0: &quot;red&quot;,</font></div><div><font face=3D"cour=
ier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0x3 =A0: 54,</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0...=A0</f=
ont></div><div><span style=3D"font-family:&#39;courier new&#39;,monospace">=
=A0 =A0 =A0 =A0 =A0},</span><font face=3D"courier new, monospace">=A0 =A0=
=A0</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0]</fo=
nt></div>


<div><font face=3D"courier new, monospace">=A0 =A0},</font></div><div><font=
 face=3D"courier new, monospace">}</font></div><div><font face=3D"courier n=
ew, monospace">=A0</font></div><div><br></div><div><font face=3D"courier ne=
w, monospace">Each file contains a single JSON document (gzip compressed, a=
nd roughly about 200KB uncompressed of pretty-printed json text per documen=
t)</font></div>


<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">I am interested in analyzing only the =A0&quot;=
important-data&quot; array and the &quot;other-important-data&quot; array.<=
/font></div>


<div><font face=3D"courier new, monospace">My source data would ideally be =
easier to analyze if it looked like a couple of tables with a fixed set of =
columns. Only the column &quot;map&quot; would be a complex column, all oth=
ers would be primitives.</font></div>


<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">( g, sg, j, page, f1, f2, f3 )</font></div><div=
><font face=3D"courier new, monospace">=A0</font></div><div>
<span style=3D"font-family:&#39;courier new&#39;,monospace">( g, sg, j, pag=
e, x1, x2, x3, map )</span><font face=3D"courier new, monospace"><br></font=
></div><div><font face=3D"courier new, monospace"><br></font></div><div>
<font face=3D"courier new, monospace">So, for each JSON document, I would l=
ike to &quot;create&quot; several rows, but=A0</font><span style=3D"font-fa=
mily:&#39;courier new&#39;,monospace">I would like to avoid the intermediat=
e step of persisting -and duplicating- the &quot;flattened&quot; data.</spa=
n></div>


<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">In order to avoid persisting the data flattened=
, I thought I had to write my own map-reduce in Java code, but discovered t=
hat others have had the same problem of using JSON as the source and there =
are somewhat &quot;standard&quot; solutions.=A0</font></div>


<div><br></div><div><font face=3D"courier new, monospace">By reading about =
the SerDe approach for Hive</font><span style=3D"font-family:&#39;courier n=
ew&#39;,monospace">=A0I get the impression that each JSON document is trans=
formed into a single &quot;row&quot; of the table with some columns being a=
n array, a map of other nested structures.=A0</span></div>


<div><span style=3D"font-family:&#39;courier new&#39;,monospace">a) Is ther=
e a way to break each JSON document into several &quot;rows&quot; for a Hiv=
e external table?</span></div><div><span style=3D"font-family:&#39;courier =
new&#39;,monospace">b) It seems there are too many JSON SerDe libraries! Is=
 there any of them considered the de-facto standard?=A0</span></div>


<div><span style=3D"font-family:&#39;courier new&#39;,monospace"><br></span=
></div><div><font face=3D"courier new, monospace">The Pig approach seems al=
so promising using Elephant Bird Do anybody has pointers to more user docum=
entation on this project? Or is browsing through the examples in GitHub my =
only source?</font></div>


<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">Thanks</font></div><div><font face=3D"courier n=
ew, monospace"><br></font></div><div><br></div><div>
<span style=3D"font-family:&#39;courier new&#39;,monospace"><br></span></di=
v><div><span style=3D"font-family:&#39;courier new&#39;,monospace"><br></sp=
an></div><div><span style=3D"font-family:&#39;courier new&#39;,monospace"><=
br>


</span></div><div><font face=3D"courier new, monospace"><br></font></div><d=
iv><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace"><br></font></div><div><font face=3D"courier new=
, monospace"><br>


</font></div><div><font face=3D"courier new, monospace"><br></font></div></=
div></div>
</blockquote></div><br></div></div></div></div></blockquote></div><br></div=
>
</blockquote></div><br></div></div></div></div></blockquote></div><br></div=
>
</div></div></blockquote></div><br></div></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--20cf307cfc407829d604df9aa2a1--