hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tecno Brain <cerebrotecnolog...@gmail.com>
Subject Re: Aggregating data nested into JSON documents
Date Wed, 19 Jun 2013 21:47:23 GMT
I also tried:

doc = LOAD '/json-pcr/pcr-000001.json' USING
 com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first, (long)json#'b'
AS second ;
DUMP flat;

but I got no output either.

     Input(s):
     Successfully read 0 records (35863 bytes) from:
"/json-pcr/pcr-000001.json"

     Output(s):
     Successfully stored 0 records in:
"hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"



On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain
<cerebrotecnologico@gmail.com>wrote:

> I got Pig and Hive working ona single-node and I am able to run some
> script/queries over regular text files (access log files); with a record
> per line.
>
> Now, I want to process some JSON files.
>
> As mentioned before, it seems  that ElephantBird would be a would be a
> good solution to read JSON files.
>
> I uploaded 5 files to HDFS. Each file only contain a single JSON document.
> The documents are NOT in a single line, but rather contain pretty-printed
> JSON expanding over multiple lines.
>
> I'm trying something simple, extracting two (primitive) attributes at the
> top of the document:
> {
>    a : "some value",
>    ...
>    b : 133,
>    ...
> }
>
> So, lets start with a LOAD of a single file (single JSON document):
>
> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
> doc = LOAD '/json-pcr/pcr-000001.json' using
>  com.twitter.elephantbird.pig.load.JsonLoader();
> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b' AS
> second ;
> DUMP flat;
>
> Apparently the job runs without problem, but I get no output. The output I
> get includes this message:
>
>    Input(s):
>    Successfully read 0 records (35863 bytes) from:
> "/json-pcr/pcr-000001.json"
>
> I was expecting to get
>
> ( "some value", 133 )
>
> Any idea on what I am doing wrong?
>
>
>
>
> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <michael_segel@hotmail.com>wrote:
>
>> I think you have a misconception of HBase.
>>
>> You don't need to actually have mutable data for it to be effective.
>> The key is that you need to have access to specific records and work a
>> very small subset of the data and not the complete data set.
>>
>>
>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <cerebrotecnologico@gmail.com>
>> wrote:
>>
>> Hi Mike,
>>
>> Yes, I also have thought about HBase or Cassandra but my data is pretty
>> much a snapshot, it does not require updates. Most of my aggregations will
>> also need to be computed once and won't change over time with the exception
>> of some aggregation that is based on the last N days of data.  Should I
>> still consider HBase ? I think that probably it will be good for the
>> aggregated data.
>>
>> I have no idea what are sequence files, but I will take a look.  My raw
>> data is stored in the cloud, not in my Hadoop cluster.
>>
>> I'll keep looking at Pig with ElephantBird.
>> Thanks,
>>
>> -Jorge
>>
>>
>>
>>
>>
>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> Hi..
>>>
>>> Have you thought about HBase?
>>>
>>> I would suggest that if you're using Hive or Pig, to look at taking
>>> these files and putting the JSON records in to a sequence file.
>>> Or set of sequence files.... (Then look at HBase to help index them...)
>>> 200KB is small.
>>>
>>> That would be the same for either pig/hive.
>>>
>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice.
>>> And yes you get each record as a row, however you can always flatten them
>>> as needed.
>>>
>>> Hive?
>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward
>>> Capriolo could give you a better answer.
>>> Going from memory, I don't know that there is a good SerDe that would
>>> write JSON, just read it. (Hive)
>>>
>>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
>>> and biased.
>>>
>>> I think you're on the right track or at least train of thought.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>
>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <cerebrotecnologico@gmail.com>
>>> wrote:
>>>
>>> Hello,
>>>    I'm new to Hadoop.
>>>    I have a large quantity of JSON documents with a structure similar to
>>> what is shown below.
>>>
>>>    {
>>>      g : "some-group-identifier",
>>>      sg: "some-subgroup-identifier",
>>>      j      : "some-job-identifier",
>>>      page     : 23,
>>>      ... // other fields omitted
>>>      important-data : [
>>>          {
>>>            f1  : "abc",
>>>            f2  : "a",
>>>            f3  : "/"
>>>            ...
>>>          },
>>>          ...
>>>          {
>>>            f1 : "xyz",
>>>            f2  : "q",
>>>            f3  : "/",
>>>            ...
>>>          },
>>>      ],
>>>     ... // other fields omitted
>>>      other-important-data : [
>>>         {
>>>            x1  : "ford",
>>>            x2  : "green",
>>>            x3  : 35
>>>            map : {
>>>                "free-field" : "value",
>>>                "other-free-field" : value2"
>>>               }
>>>          },
>>>          ...
>>>          {
>>>            x1 : "vw",
>>>            x2  : "red",
>>>            x3  : 54,
>>>            ...
>>>          },
>>>      ]
>>>    },
>>> }
>>>
>>>
>>> Each file contains a single JSON document (gzip compressed, and roughly
>>> about 200KB uncompressed of pretty-printed json text per document)
>>>
>>> I am interested in analyzing only the  "important-data" array and the
>>> "other-important-data" array.
>>> My source data would ideally be easier to analyze if it looked like a
>>> couple of tables with a fixed set of columns. Only the column "map" would
>>> be a complex column, all others would be primitives.
>>>
>>> ( g, sg, j, page, f1, f2, f3 )
>>>
>>> ( g, sg, j, page, x1, x2, x3, map )
>>>
>>> So, for each JSON document, I would like to "create" several rows, but I
>>> would like to avoid the intermediate step of persisting -and duplicating-
>>> the "flattened" data.
>>>
>>> In order to avoid persisting the data flattened, I thought I had to
>>> write my own map-reduce in Java code, but discovered that others have had
>>> the same problem of using JSON as the source and there are somewhat
>>> "standard" solutions.
>>>
>>> By reading about the SerDe approach for Hive I get the impression that
>>> each JSON document is transformed into a single "row" of the table with
>>> some columns being an array, a map of other nested structures.
>>> a) Is there a way to break each JSON document into several "rows" for a
>>> Hive external table?
>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>> them considered the de-facto standard?
>>>
>>> The Pig approach seems also promising using Elephant Bird Do anybody has
>>> pointers to more user documentation on this project? Or is browsing through
>>> the examples in GitHub my only source?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Mime
View raw message