hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tecno Brain <cerebrotecnolog...@gmail.com>
Subject Re: Aggregating data nested into JSON documents
Date Wed, 19 Jun 2013 22:44:30 GMT
Ok, I found that elephant-bird JsonLoader cannot handle JSON documents that
are pretty-printed. (expanding over multiple-lines) The entire json
document has to be on a single line.

After I reformated some of the source files, now I am getting the expected
output.




On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain
<cerebrotecnologico@gmail.com>wrote:

> I also tried:
>
> doc = LOAD '/json-pcr/pcr-000001.json' USING
>  com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
> flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first, (long)json#'b'
> AS second ;
> DUMP flat;
>
> but I got no output either.
>
>      Input(s):
>      Successfully read 0 records (35863 bytes) from:
> "/json-pcr/pcr-000001.json"
>
>      Output(s):
>      Successfully stored 0 records in:
> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>
>
>
> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <cerebrotecnologico@gmail.com
> > wrote:
>
>> I got Pig and Hive working ona single-node and I am able to run some
>> script/queries over regular text files (access log files); with a record
>> per line.
>>
>> Now, I want to process some JSON files.
>>
>> As mentioned before, it seems  that ElephantBird would be a would be a
>> good solution to read JSON files.
>>
>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>> document. The documents are NOT in a single line, but rather contain
>> pretty-printed JSON expanding over multiple lines.
>>
>> I'm trying something simple, extracting two (primitive) attributes at the
>> top of the document:
>> {
>>    a : "some value",
>>    ...
>>    b : 133,
>>    ...
>> }
>>
>> So, lets start with a LOAD of a single file (single JSON document):
>>
>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>  com.twitter.elephantbird.pig.load.JsonLoader();
>> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b' AS
>> second ;
>> DUMP flat;
>>
>> Apparently the job runs without problem, but I get no output. The output
>> I get includes this message:
>>
>>    Input(s):
>>    Successfully read 0 records (35863 bytes) from:
>> "/json-pcr/pcr-000001.json"
>>
>> I was expecting to get
>>
>> ( "some value", 133 )
>>
>> Any idea on what I am doing wrong?
>>
>>
>>
>>
>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> I think you have a misconception of HBase.
>>>
>>> You don't need to actually have mutable data for it to be effective.
>>> The key is that you need to have access to specific records and work a
>>> very small subset of the data and not the complete data set.
>>>
>>>
>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <cerebrotecnologico@gmail.com>
>>> wrote:
>>>
>>> Hi Mike,
>>>
>>> Yes, I also have thought about HBase or Cassandra but my data is pretty
>>> much a snapshot, it does not require updates. Most of my aggregations will
>>> also need to be computed once and won't change over time with the exception
>>> of some aggregation that is based on the last N days of data.  Should I
>>> still consider HBase ? I think that probably it will be good for the
>>> aggregated data.
>>>
>>> I have no idea what are sequence files, but I will take a look.  My raw
>>> data is stored in the cloud, not in my Hadoop cluster.
>>>
>>> I'll keep looking at Pig with ElephantBird.
>>> Thanks,
>>>
>>> -Jorge
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> Hi..
>>>>
>>>> Have you thought about HBase?
>>>>
>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>> these files and putting the JSON records in to a sequence file.
>>>> Or set of sequence files.... (Then look at HBase to help index them...)
>>>> 200KB is small.
>>>>
>>>> That would be the same for either pig/hive.
>>>>
>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>> nice. And yes you get each record as a row, however you can always flatten
>>>> them as needed.
>>>>
>>>> Hive?
>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>> Edward Capriolo could give you a better answer.
>>>> Going from memory, I don't know that there is a good SerDe that would
>>>> write JSON, just read it. (Hive)
>>>>
>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
>>>> and biased.
>>>>
>>>> I think you're on the right track or at least train of thought.
>>>>
>>>> HTH
>>>>
>>>> -Mike
>>>>
>>>>
>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <cerebrotecnologico@gmail.com>
>>>> wrote:
>>>>
>>>> Hello,
>>>>    I'm new to Hadoop.
>>>>    I have a large quantity of JSON documents with a structure similar
>>>> to what is shown below.
>>>>
>>>>    {
>>>>      g : "some-group-identifier",
>>>>      sg: "some-subgroup-identifier",
>>>>      j      : "some-job-identifier",
>>>>      page     : 23,
>>>>      ... // other fields omitted
>>>>      important-data : [
>>>>          {
>>>>            f1  : "abc",
>>>>            f2  : "a",
>>>>            f3  : "/"
>>>>            ...
>>>>          },
>>>>          ...
>>>>          {
>>>>            f1 : "xyz",
>>>>            f2  : "q",
>>>>            f3  : "/",
>>>>            ...
>>>>          },
>>>>      ],
>>>>     ... // other fields omitted
>>>>      other-important-data : [
>>>>         {
>>>>            x1  : "ford",
>>>>            x2  : "green",
>>>>            x3  : 35
>>>>            map : {
>>>>                "free-field" : "value",
>>>>                "other-free-field" : value2"
>>>>               }
>>>>          },
>>>>          ...
>>>>          {
>>>>            x1 : "vw",
>>>>            x2  : "red",
>>>>            x3  : 54,
>>>>            ...
>>>>          },
>>>>      ]
>>>>    },
>>>> }
>>>>
>>>>
>>>> Each file contains a single JSON document (gzip compressed, and roughly
>>>> about 200KB uncompressed of pretty-printed json text per document)
>>>>
>>>> I am interested in analyzing only the  "important-data" array and the
>>>> "other-important-data" array.
>>>> My source data would ideally be easier to analyze if it looked like a
>>>> couple of tables with a fixed set of columns. Only the column "map" would
>>>> be a complex column, all others would be primitives.
>>>>
>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>
>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>
>>>> So, for each JSON document, I would like to "create" several rows, but I
>>>> would like to avoid the intermediate step of persisting -and duplicating-
>>>> the "flattened" data.
>>>>
>>>> In order to avoid persisting the data flattened, I thought I had to
>>>> write my own map-reduce in Java code, but discovered that others have had
>>>> the same problem of using JSON as the source and there are somewhat
>>>> "standard" solutions.
>>>>
>>>> By reading about the SerDe approach for Hive I get the impression that
>>>> each JSON document is transformed into a single "row" of the table with
>>>> some columns being an array, a map of other nested structures.
>>>> a) Is there a way to break each JSON document into several "rows" for a
>>>> Hive external table?
>>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>>> them considered the de-facto standard?
>>>>
>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>> has pointers to more user documentation on this project? Or is browsing
>>>> through the examples in GitHub my only source?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Mime
View raw message