Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of cerebrotecnologico@gmail.com
 designates 209.85.128.182 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <BLU0-SMTP4052109CF9D34DC249C8EA28F870@phx.gbl>
References: 
 <CANj+m=iOFeUK+HMhgFjcMjDWV0KKGq2tywE-P+gWxz0==yf4Qw@mail.gmail.com>
	<BLU0-SMTP20028AC4CD9E933194604158F870@phx.gbl>
	<CANj+m=j9vjQhHMREfOrQPRBvVr5zFX-LcBKVMxifKUanLRL-cQ@mail.gmail.com>
	<BLU0-SMTP4052109CF9D34DC249C8EA28F870@phx.gbl>
Date: Wed, 19 Jun 2013 14:36:36 -0700
Message-ID: 
 <CANj+m=iB5=51StnC0CXNd1pydMKZpNKx8Vh+15RFYmLDfCZs8w@mail.gmail.com>
Subject: Re: Aggregating data nested into JSON documents
From: Tecno Brain <cerebrotecnologico@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=047d7bea3950d4cccd04df889fca

--047d7bea3950d4cccd04df889fca
Content-Type: text/plain; charset=ISO-8859-1

I got Pig and Hive working ona single-node and I am able to run some
script/queries over regular text files (access log files); with a record
per line.

Now, I want to process some JSON files.

As mentioned before, it seems  that ElephantBird would be a would be a good
solution to read JSON files.

I uploaded 5 files to HDFS. Each file only contain a single JSON document.
The documents are NOT in a single line, but rather contain pretty-printed
JSON expanding over multiple lines.

I'm trying something simple, extracting two (primitive) attributes at the
top of the document:
{
   a : "some value",
   ...
   b : 133,
   ...
}

So, lets start with a LOAD of a single file (single JSON document):

REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
doc = LOAD '/json-pcr/pcr-000001.json' using
 com.twitter.elephantbird.pig.load.JsonLoader();
flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b' AS
second ;
DUMP flat;

Apparently the job runs without problem, but I get no output. The output I
get includes this message:

   Input(s):
   Successfully read 0 records (35863 bytes) from:
"/json-pcr/pcr-000001.json"

I was expecting to get

( "some value", 133 )

Any idea on what I am doing wrong?


On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <michael_segel@hotmail.com>wrote:

> I think you have a misconception of HBase.
>
> You don't need to actually have mutable data for it to be effective.
> The key is that you need to have access to specific records and work a
> very small subset of the data and not the complete data set.
>
>
> On Jun 13, 2013, at 11:59 AM, Tecno Brain <cerebrotecnologico@gmail.com>
> wrote:
>
> Hi Mike,
>
> Yes, I also have thought about HBase or Cassandra but my data is pretty
> much a snapshot, it does not require updates. Most of my aggregations will
> also need to be computed once and won't change over time with the exception
> of some aggregation that is based on the last N days of data.  Should I
> still consider HBase ? I think that probably it will be good for the
> aggregated data.
>
> I have no idea what are sequence files, but I will take a look.  My raw
> data is stored in the cloud, not in my Hadoop cluster.
>
> I'll keep looking at Pig with ElephantBird.
> Thanks,
>
> -Jorge
>
>
>
>
>
> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <michael_segel@hotmail.com>wrote:
>
>> Hi..
>>
>> Have you thought about HBase?
>>
>> I would suggest that if you're using Hive or Pig, to look at taking these
>> files and putting the JSON records in to a sequence file.
>> Or set of sequence files.... (Then look at HBase to help index them...)
>> 200KB is small.
>>
>> That would be the same for either pig/hive.
>>
>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice.
>> And yes you get each record as a row, however you can always flatten them
>> as needed.
>>
>> Hive?
>> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward
>> Capriolo could give you a better answer.
>> Going from memory, I don't know that there is a good SerDe that would
>> write JSON, just read it. (Hive)
>>
>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
>> and biased.
>>
>> I think you're on the right track or at least train of thought.
>>
>> HTH
>>
>> -Mike
>>
>>
>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <cerebrotecnologico@gmail.com>
>> wrote:
>>
>> Hello,
>>    I'm new to Hadoop.
>>    I have a large quantity of JSON documents with a structure similar to
>> what is shown below.
>>
>>    {
>>      g : "some-group-identifier",
>>      sg: "some-subgroup-identifier",
>>      j      : "some-job-identifier",
>>      page     : 23,
>>      ... // other fields omitted
>>      important-data : [
>>          {
>>            f1  : "abc",
>>            f2  : "a",
>>            f3  : "/"
>>            ...
>>          },
>>          ...
>>          {
>>            f1 : "xyz",
>>            f2  : "q",
>>            f3  : "/",
>>            ...
>>          },
>>      ],
>>     ... // other fields omitted
>>      other-important-data : [
>>         {
>>            x1  : "ford",
>>            x2  : "green",
>>            x3  : 35
>>            map : {
>>                "free-field" : "value",
>>                "other-free-field" : value2"
>>               }
>>          },
>>          ...
>>          {
>>            x1 : "vw",
>>            x2  : "red",
>>            x3  : 54,
>>            ...
>>          },
>>      ]
>>    },
>> }
>>
>>
>> Each file contains a single JSON document (gzip compressed, and roughly
>> about 200KB uncompressed of pretty-printed json text per document)
>>
>> I am interested in analyzing only the  "important-data" array and the
>> "other-important-data" array.
>> My source data would ideally be easier to analyze if it looked like a
>> couple of tables with a fixed set of columns. Only the column "map" would
>> be a complex column, all others would be primitives.
>>
>> ( g, sg, j, page, f1, f2, f3 )
>>
>> ( g, sg, j, page, x1, x2, x3, map )
>>
>> So, for each JSON document, I would like to "create" several rows, but I
>> would like to avoid the intermediate step of persisting -and duplicating-
>> the "flattened" data.
>>
>> In order to avoid persisting the data flattened, I thought I had to write
>> my own map-reduce in Java code, but discovered that others have had the
>> same problem of using JSON as the source and there are somewhat "standard"
>> solutions.
>>
>> By reading about the SerDe approach for Hive I get the impression that
>> each JSON document is transformed into a single "row" of the table with
>> some columns being an array, a map of other nested structures.
>> a) Is there a way to break each JSON document into several "rows" for a
>> Hive external table?
>> b) It seems there are too many JSON SerDe libraries! Is there any of them
>> considered the de-facto standard?
>>
>> The Pig approach seems also promising using Elephant Bird Do anybody has
>> pointers to more user documentation on this project? Or is browsing through
>> the examples in GitHub my only source?
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

--047d7bea3950d4cccd04df889fca
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I got Pig and Hive working ona single-node and I am able t=
o run some script/queries over regular text files (access log files); with =
a record per line.=A0<div><br><div style>Now, I want to process some JSON f=
iles.</div>
<div style><br></div><div style>As mentioned before, it seems =A0that Eleph=
antBird would be a would be a good solution to read JSON files.=A0</div><di=
v style><br></div><div style>I uploaded 5 files to HDFS. Each file only con=
tain a single JSON document. The documents are NOT in a single line, but ra=
ther contain pretty-printed JSON expanding over multiple lines.=A0</div>
<div style><br></div><div style>I&#39;m trying something simple, extracting=
 two (primitive) attributes at the top of the document:</div><div style><fo=
nt face=3D"courier new, monospace">{</font></div><div style><font face=3D"c=
ourier new, monospace">=A0 =A0a : &quot;some value&quot;,</font></div>
<div style><font face=3D"courier new, monospace">=A0 =A0...</font></div><di=
v style><font face=3D"courier new, monospace">=A0 =A0b : 133,</font></div><=
div style><font face=3D"courier new, monospace">=A0 =A0...=A0</font></div><=
div style><font face=3D"courier new, monospace">}</font></div>
<div style><br></div><div style>So, lets start with a LOAD of a single file=
 (single JSON document):</div><div style><br></div><div style><font face=3D=
"courier new, monospace">REGISTER &#39;bunch of JAR files from elephant-bir=
d and its dependencies&#39;;</font></div>
<div style><font face=3D"courier new, monospace">doc =3D LOAD &#39;/json-pc=
r/pcr-000001.json&#39; using =A0com.twitter.elephantbird.pig.load.JsonLoade=
r();=A0<br></font></div><div style><font face=3D"courier new, monospace">fl=
at =A0=3D FOREACH doc GENERATE (chararray)$0#&#39;a&#39; AS =A0first, (long=
)$0#&#39;b&#39; AS second ;<br>
</font></div><div style><font face=3D"courier new, monospace">DUMP flat;</f=
ont></div><div style><br></div><div style>Apparently the job runs without p=
roblem, but I get no output. The output I get includes this message:</div>
<div style><br></div><div style><div>=A0 =A0Input(s):</div><div>=A0 =A0Succ=
essfully read 0 records (35863 bytes) from: &quot;/json-pcr/pcr-000001.json=
&quot;</div></div><div style><br></div><div style>I was expecting to get=A0=
</div>
<div style><br></div><div style>( &quot;some value&quot;, 133 )</div><div s=
tyle><br></div><div style>Any idea on what I am doing wrong?</div><div styl=
e><br></div><div style><br></div></div></div><div class=3D"gmail_extra"><br=
>
<br><div class=3D"gmail_quote">On Thu, Jun 13, 2013 at 3:05 PM, Michael Seg=
el <span dir=3D"ltr">&lt;<a href=3D"mailto:michael_segel@hotmail.com" targe=
t=3D"_blank">michael_segel@hotmail.com</a>&gt;</span> wrote:<br><blockquote=
 class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli=
d;padding-left:1ex">
<div style=3D"word-wrap:break-word">I think you have a misconception of HBa=
se.=A0<div><br></div><div>You don&#39;t need to actually have mutable data =
for it to be effective.=A0</div><div>The key is that you need to have acces=
s to specific records and work a very small subset of the data and not the =
complete data set.=A0</div>
<div><div class=3D"h5"><div><br></div><div><br><div><div>On Jun 13, 2013, a=
t 11:59 AM, Tecno Brain &lt;<a href=3D"mailto:cerebrotecnologico@gmail.com"=
 target=3D"_blank">cerebrotecnologico@gmail.com</a>&gt; wrote:</div><br><bl=
ockquote type=3D"cite">
<div dir=3D"ltr"><div>Hi Mike,</div><div><br></div>Yes, I also have thought=
 about HBase or Cassandra but my data is pretty much a snapshot, it does no=
t require updates. Most of my aggregations will also need to be computed on=
ce and won&#39;t change over time with the exception of some aggregation th=
at is based on the last N days of data. =A0Should I still consider HBase ? =
I think that probably it will be good for the aggregated data.=A0<div>

<br></div><div>I have no idea what are sequence files, but I will take a lo=
ok. =A0My raw data is stored in the cloud, not in my Hadoop cluster.=A0</di=
v><div><br></div><div>I&#39;ll keep looking at Pig with ElephantBird.=A0</d=
iv>

<div>Thanks,</div><div><br></div><div>-Jorge=A0</div><div><br></div><div><b=
r></div><div><br></div></div><div class=3D"gmail_extra"><br><br><div class=
=3D"gmail_quote">On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <span dir=
=3D"ltr">&lt;<a href=3D"mailto:michael_segel@hotmail.com" target=3D"_blank"=
>michael_segel@hotmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div style=3D"word-wrap:break-word">Hi..<div=
><br></div><div>Have you thought about HBase?=A0</div><div><br></div><div>I=
 would suggest that if you&#39;re using Hive or Pig, to look at taking thes=
e files and putting the JSON records in to a sequence file.=A0</div>

<div>Or set of sequence files.... (Then look at HBase to help index them...=
) 200KB is small.=A0</div><div><br></div><div>That would be the same for ei=
ther pig/hive.</div><div><br></div><div>In terms of SerDes, I&#39;ve worked=
 w Pig and ElephantBird, its pretty nice. And yes you get each record as a =
row, however you can always flatten them as needed.=A0</div>

<div><br></div><div>Hive?=A0</div><div>I haven&#39;t worked with the latest=
 SerDe, but maybe Dean Wampler or Edward Capriolo could give you a better a=
nswer.=A0</div><div>Going from memory, I don&#39;t know that there is a goo=
d SerDe that would write JSON, just read it. (Hive)</div>

<div><br></div><div>IMHO Pig/ElephantBird is the best so far, but then agai=
n I may be dated and biased.=A0</div><div><br></div><div>I think you&#39;re=
 on the right track or at least train of thought.=A0</div><div><br></div><d=
iv>

HTH</div><div><br></div><div>-Mike</div><div><div><div><br></div><div><br><=
/div><div><div><div>On Jun 12, 2013, at 7:57 PM, Tecno Brain &lt;<a href=3D=
"mailto:cerebrotecnologico@gmail.com" target=3D"_blank">cerebrotecnologico@=
gmail.com</a>&gt; wrote:</div>

<br><blockquote type=3D"cite"><div dir=3D"ltr"><font face=3D"courier new, m=
onospace">Hello,=A0</font><div><font face=3D"courier new, monospace">=A0 =
=A0I&#39;m new to Hadoop.=A0</font></div><div><div><font face=3D"courier ne=
w, monospace">=A0 =A0I have a large quantity of JSON documents with a struc=
ture similar to what is shown below. =A0</font></div>


<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">=A0 =A0{</font></div><div><font face=3D"courier=
 new, monospace">=A0 =A0 =A0g : &quot;some-group-identifier&quot;,<br>
</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0sg: &quo=
t;some-subgroup-identifier&quot;,</font></div><div><font face=3D"courier ne=
w, monospace">=A0 =A0 =A0j =A0 =A0 =A0: &quot;some-job-identifier&quot;,</f=
ont></div>
<div><font face=3D"courier new, monospace">=A0 =A0 =A0page =A0 =A0 : 23,</f=
ont></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0... // othe=
r fields=A0omitted</font></div><div><font face=3D"courier new, monospace">=
=A0 =A0 =A0important-data : [</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0{</font></div=
><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f1 =A0: =
&quot;abc&quot;,</font></div><div><font face=3D"courier new, monospace">=A0=
 =A0 =A0 =A0 =A0 =A0f2 =A0: &quot;a&quot;,</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f3 =A0: &=
quot;/&quot;</font></div><div><font face=3D"courier new, monospace">=A0 =A0=
 =A0 =A0 =A0 =A0...</font></div><div><font face=3D"courier new, monospace">=
=A0 =A0 =A0 =A0 =A0},</font></div>
<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0...</font></d=
iv><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0{</font></=
div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f1 : =
&quot;xyz&quot;,</font></div>
<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0f2 =A0: &=
quot;q&quot;,</font></div><div><font face=3D"courier new, monospace">=A0 =
=A0 =A0 =A0 =A0 =A0f3 =A0: &quot;/&quot;,</font></div><div><font face=3D"co=
urier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0...=A0</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0},</font></di=
v><div><font face=3D"courier new, monospace">=A0 =A0 =A0],</font></div><div=
><font face=3D"courier new, monospace">=A0 =A0 ... // other fields=A0omitte=
d=A0</font></div>
<div><font face=3D"courier new, monospace">=A0 =A0 =A0other-important-data =
: [</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =
{</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =
=A0 =A0x1 =A0: &quot;ford&quot;,</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0x2 =A0: &=
quot;green&quot;,</font></div><div><font face=3D"courier new, monospace">=
=A0 =A0 =A0 =A0 =A0 =A0x3 =A0: 35</font></div><div><font face=3D"courier ne=
w, monospace">=A0 =A0 =A0 =A0 =A0 =A0map : {</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0&=
quot;free-field&quot; : &quot;value&quot;,</font></div><div><font face=3D"c=
ourier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0&quot;other-free-fiel=
d&quot; : value2&quot;</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0 =A0 }</f=
ont></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0},<=
/font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0.=
..</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =
=A0{</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0x1 : &quo=
t;vw&quot;,</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =
=A0 =A0 =A0 =A0x2 =A0: &quot;red&quot;,</font></div><div><font face=3D"cour=
ier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0x3 =A0: 54,</font></div>


<div><font face=3D"courier new, monospace">=A0 =A0 =A0 =A0 =A0 =A0...=A0</f=
ont></div><div><span style=3D"font-family:&#39;courier new&#39;,monospace">=
=A0 =A0 =A0 =A0 =A0},</span><font face=3D"courier new, monospace">=A0 =A0=
=A0</font></div><div><font face=3D"courier new, monospace">=A0 =A0 =A0]</fo=
nt></div>


<div><font face=3D"courier new, monospace">=A0 =A0},</font></div><div><font=
 face=3D"courier new, monospace">}</font></div><div><font face=3D"courier n=
ew, monospace">=A0</font></div><div><br></div><div><font face=3D"courier ne=
w, monospace">Each file contains a single JSON document (gzip compressed, a=
nd roughly about 200KB uncompressed of pretty-printed json text per documen=
t)</font></div>


<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">I am interested in analyzing only the =A0&quot;=
important-data&quot; array and the &quot;other-important-data&quot; array.<=
/font></div>


<div><font face=3D"courier new, monospace">My source data would ideally be =
easier to analyze if it looked like a couple of tables with a fixed set of =
columns. Only the column &quot;map&quot; would be a complex column, all oth=
ers would be primitives.</font></div>


<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">( g, sg, j, page, f1, f2, f3 )</font></div><div=
><font face=3D"courier new, monospace">=A0</font></div><div>
<span style=3D"font-family:&#39;courier new&#39;,monospace">( g, sg, j, pag=
e, x1, x2, x3, map )</span><font face=3D"courier new, monospace"><br></font=
></div><div><font face=3D"courier new, monospace"><br></font></div><div>
<font face=3D"courier new, monospace">So, for each JSON document, I would l=
ike to &quot;create&quot; several rows, but=A0</font><span style=3D"font-fa=
mily:&#39;courier new&#39;,monospace">I would like to avoid the intermediat=
e step of persisting -and duplicating- the &quot;flattened&quot; data.</spa=
n></div>


<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">In order to avoid persisting the data flattened=
, I thought I had to write my own map-reduce in Java code, but discovered t=
hat others have had the same problem of using JSON as the source and there =
are somewhat &quot;standard&quot; solutions.=A0</font></div>


<div><br></div><div><font face=3D"courier new, monospace">By reading about =
the SerDe approach for Hive</font><span style=3D"font-family:&#39;courier n=
ew&#39;,monospace">=A0I get the impression that each JSON document is trans=
formed into a single &quot;row&quot; of the table with some columns being a=
n array, a map of other nested structures.=A0</span></div>


<div><span style=3D"font-family:&#39;courier new&#39;,monospace">a) Is ther=
e a way to break each JSON document into several &quot;rows&quot; for a Hiv=
e external table?</span></div><div><span style=3D"font-family:&#39;courier =
new&#39;,monospace">b) It seems there are too many JSON SerDe libraries! Is=
 there any of them considered the de-facto standard?=A0</span></div>


<div><span style=3D"font-family:&#39;courier new&#39;,monospace"><br></span=
></div><div><font face=3D"courier new, monospace">The Pig approach seems al=
so promising using Elephant Bird Do anybody has pointers to more user docum=
entation on this project? Or is browsing through the examples in GitHub my =
only source?</font></div>


<div><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace">Thanks</font></div><div><font face=3D"courier n=
ew, monospace"><br></font></div><div><br></div><div>
<span style=3D"font-family:&#39;courier new&#39;,monospace"><br></span></di=
v><div><span style=3D"font-family:&#39;courier new&#39;,monospace"><br></sp=
an></div><div><span style=3D"font-family:&#39;courier new&#39;,monospace"><=
br>


</span></div><div><font face=3D"courier new, monospace"><br></font></div><d=
iv><font face=3D"courier new, monospace"><br></font></div><div><font face=
=3D"courier new, monospace"><br></font></div><div><font face=3D"courier new=
, monospace"><br>


</font></div><div><font face=3D"courier new, monospace"><br></font></div></=
div></div>
</blockquote></div><br></div></div></div></div></blockquote></div><br></div=
>
</blockquote></div><br></div></div></div></div></blockquote></div><br></div=
>

--047d7bea3950d4cccd04df889fca--