hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tecno Brain <cerebrotecnolog...@gmail.com>
Subject Re: Aggregating data nested into JSON documents
Date Thu, 20 Jun 2013 19:05:50 GMT
Never mind, I got the solution!

uberflat = FOREACH flat GENERATE g, sg,
              FLATTEN(important-data#'f1') as f1,
              FLATTEN(important-data#'f2') as f2;

-Jorge


On Thu, Jun 20, 2013 at 11:54 AM, Tecno Brain
<cerebrotecnologico@gmail.com>wrote:

> OK, I'll go back to my original question ( although this time I know what
> tools I'm using).
>
> I am using Pig + ElephantBird.
>
> I have JSON documents with the following structure:
> {
>      g : "some-group-identifier",
>      sg: "some-subgroup-identifier",
>      j      : "some-job-identifier",
>      page     : 23,
>      ... // other fields omitted
>      important-data : [
>          {
>            f1  : "abc",
>            f2  : "a",
>            f3  : "/"
>            ...
>          },
>          ...
>          {
>            f1 : "xyz",
>            f2  : "q",
>            f3  : "/",
>            ...
>          },
>      ]
>     ... // other fields omitted
> }
>
> I want Pig to GENERATE a tuple for each element on the "important-data"
> array attribute. For the example above, I would like to generate:
>
> ( "some-group-identifier" , "some-subgroup-identifier", 23, "abc", "a",
> "/" )
> ( "some-group-identifier" , "some-subgroup-identifier", 23, "xyz", "q",
> "/" )
>
> This is what I have tried:
>
> doc = LOAD '/example.json' USING
>      com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as
> (json:map[]);
> flat = FOREACH doc  GENERATE  (chararray)json#'gr' as g, (long)json#'sg'
> as sg,  FLATTEN( json#'important-data') ;
> DUMP flat;
>
> but that produces:
>
> ( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#abc,
> f2#a, f3#/ ] )
> ( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#xyz,
> f2#q, f3#/ ] )
>
> Close, but not exactly what I want.
>
> Do I require to use ProtoBuf ?
>
> -Jorge
>
>
> On Wed, Jun 19, 2013 at 3:44 PM, Tecno Brain <cerebrotecnologico@gmail.com
> > wrote:
>
>> Ok, I found that elephant-bird JsonLoader cannot handle JSON documents
>> that are pretty-printed. (expanding over multiple-lines) The entire json
>> document has to be on a single line.
>>
>> After I reformated some of the source files, now I am getting the
>> expected output.
>>
>>
>>
>>
>> On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain <
>> cerebrotecnologico@gmail.com> wrote:
>>
>>> I also tried:
>>>
>>> doc = LOAD '/json-pcr/pcr-000001.json' USING
>>>  com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
>>>  flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first,
>>> (long)json#'b' AS second ;
>>> DUMP flat;
>>>
>>> but I got no output either.
>>>
>>>      Input(s):
>>>      Successfully read 0 records (35863 bytes) from:
>>> "/json-pcr/pcr-000001.json"
>>>
>>>      Output(s):
>>>      Successfully stored 0 records in:
>>> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>>>
>>>
>>>
>>> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <
>>> cerebrotecnologico@gmail.com> wrote:
>>>
>>>> I got Pig and Hive working ona single-node and I am able to run some
>>>> script/queries over regular text files (access log files); with a record
>>>> per line.
>>>>
>>>> Now, I want to process some JSON files.
>>>>
>>>> As mentioned before, it seems  that ElephantBird would be a would be a
>>>> good solution to read JSON files.
>>>>
>>>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>>>> document. The documents are NOT in a single line, but rather contain
>>>> pretty-printed JSON expanding over multiple lines.
>>>>
>>>> I'm trying something simple, extracting two (primitive) attributes at
>>>> the top of the document:
>>>> {
>>>>    a : "some value",
>>>>    ...
>>>>    b : 133,
>>>>    ...
>>>> }
>>>>
>>>> So, lets start with a LOAD of a single file (single JSON document):
>>>>
>>>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>>>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>>>  com.twitter.elephantbird.pig.load.JsonLoader();
>>>> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b'
>>>> AS second ;
>>>> DUMP flat;
>>>>
>>>> Apparently the job runs without problem, but I get no output. The
>>>> output I get includes this message:
>>>>
>>>>    Input(s):
>>>>    Successfully read 0 records (35863 bytes) from:
>>>> "/json-pcr/pcr-000001.json"
>>>>
>>>> I was expecting to get
>>>>
>>>> ( "some value", 133 )
>>>>
>>>> Any idea on what I am doing wrong?
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <
>>>> michael_segel@hotmail.com> wrote:
>>>>
>>>>> I think you have a misconception of HBase.
>>>>>
>>>>> You don't need to actually have mutable data for it to be effective.
>>>>> The key is that you need to have access to specific records and work
a
>>>>> very small subset of the data and not the complete data set.
>>>>>
>>>>>
>>>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <
>>>>> cerebrotecnologico@gmail.com> wrote:
>>>>>
>>>>> Hi Mike,
>>>>>
>>>>> Yes, I also have thought about HBase or Cassandra but my data is
>>>>> pretty much a snapshot, it does not require updates. Most of my
>>>>> aggregations will also need to be computed once and won't change over
time
>>>>> with the exception of some aggregation that is based on the last N days
of
>>>>> data.  Should I still consider HBase ? I think that probably it will
be
>>>>> good for the aggregated data.
>>>>>
>>>>> I have no idea what are sequence files, but I will take a look.  My
>>>>> raw data is stored in the cloud, not in my Hadoop cluster.
>>>>>
>>>>> I'll keep looking at Pig with ElephantBird.
>>>>> Thanks,
>>>>>
>>>>> -Jorge
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>>>> michael_segel@hotmail.com> wrote:
>>>>>
>>>>>> Hi..
>>>>>>
>>>>>> Have you thought about HBase?
>>>>>>
>>>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>>>> these files and putting the JSON records in to a sequence file.
>>>>>> Or set of sequence files.... (Then look at HBase to help index
>>>>>> them...) 200KB is small.
>>>>>>
>>>>>> That would be the same for either pig/hive.
>>>>>>
>>>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>>>> nice. And yes you get each record as a row, however you can always
flatten
>>>>>> them as needed.
>>>>>>
>>>>>> Hive?
>>>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>>>> Edward Capriolo could give you a better answer.
>>>>>> Going from memory, I don't know that there is a good SerDe that would
>>>>>> write JSON, just read it. (Hive)
>>>>>>
>>>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be
>>>>>> dated and biased.
>>>>>>
>>>>>> I think you're on the right track or at least train of thought.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> -Mike
>>>>>>
>>>>>>
>>>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <
>>>>>> cerebrotecnologico@gmail.com> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>    I'm new to Hadoop.
>>>>>>    I have a large quantity of JSON documents with a structure similar
>>>>>> to what is shown below.
>>>>>>
>>>>>>    {
>>>>>>      g : "some-group-identifier",
>>>>>>      sg: "some-subgroup-identifier",
>>>>>>      j      : "some-job-identifier",
>>>>>>      page     : 23,
>>>>>>      ... // other fields omitted
>>>>>>      important-data : [
>>>>>>          {
>>>>>>            f1  : "abc",
>>>>>>            f2  : "a",
>>>>>>            f3  : "/"
>>>>>>            ...
>>>>>>          },
>>>>>>          ...
>>>>>>          {
>>>>>>            f1 : "xyz",
>>>>>>            f2  : "q",
>>>>>>            f3  : "/",
>>>>>>            ...
>>>>>>          },
>>>>>>      ],
>>>>>>     ... // other fields omitted
>>>>>>      other-important-data : [
>>>>>>         {
>>>>>>            x1  : "ford",
>>>>>>            x2  : "green",
>>>>>>            x3  : 35
>>>>>>            map : {
>>>>>>                "free-field" : "value",
>>>>>>                "other-free-field" : value2"
>>>>>>               }
>>>>>>          },
>>>>>>          ...
>>>>>>          {
>>>>>>            x1 : "vw",
>>>>>>            x2  : "red",
>>>>>>            x3  : 54,
>>>>>>            ...
>>>>>>          },
>>>>>>      ]
>>>>>>    },
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Each file contains a single JSON document (gzip compressed, and
>>>>>> roughly about 200KB uncompressed of pretty-printed json text per
document)
>>>>>>
>>>>>> I am interested in analyzing only the  "important-data" array and
the
>>>>>> "other-important-data" array.
>>>>>> My source data would ideally be easier to analyze if it looked like
a
>>>>>> couple of tables with a fixed set of columns. Only the column "map"
would
>>>>>> be a complex column, all others would be primitives.
>>>>>>
>>>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>>>
>>>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>>>
>>>>>> So, for each JSON document, I would like to "create" several rows,
>>>>>> but I would like to avoid the intermediate step of persisting -and
>>>>>> duplicating- the "flattened" data.
>>>>>>
>>>>>> In order to avoid persisting the data flattened, I thought I had
to
>>>>>> write my own map-reduce in Java code, but discovered that others
have had
>>>>>> the same problem of using JSON as the source and there are somewhat
>>>>>> "standard" solutions.
>>>>>>
>>>>>> By reading about the SerDe approach for Hive I get the impression
>>>>>> that each JSON document is transformed into a single "row" of the
table
>>>>>> with some columns being an array, a map of other nested structures.
>>>>>> a) Is there a way to break each JSON document into several "rows"
for
>>>>>> a Hive external table?
>>>>>> b) It seems there are too many JSON SerDe libraries! Is there any
of
>>>>>> them considered the de-facto standard?
>>>>>>
>>>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>>>> has pointers to more user documentation on this project? Or is browsing
>>>>>> through the examples in GitHub my only source?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message