hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tecno Brain <cerebrotecnolog...@gmail.com>
Subject Re: Aggregating data nested into JSON documents
Date Wed, 19 Jun 2013 21:36:36 GMT
I got Pig and Hive working ona single-node and I am able to run some
script/queries over regular text files (access log files); with a record
per line.

Now, I want to process some JSON files.

As mentioned before, it seems  that ElephantBird would be a would be a good
solution to read JSON files.

I uploaded 5 files to HDFS. Each file only contain a single JSON document.
The documents are NOT in a single line, but rather contain pretty-printed
JSON expanding over multiple lines.

I'm trying something simple, extracting two (primitive) attributes at the
top of the document:
{
   a : "some value",
   ...
   b : 133,
   ...
}

So, lets start with a LOAD of a single file (single JSON document):

REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
doc = LOAD '/json-pcr/pcr-000001.json' using
 com.twitter.elephantbird.pig.load.JsonLoader();
flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b' AS
second ;
DUMP flat;

Apparently the job runs without problem, but I get no output. The output I
get includes this message:

   Input(s):
   Successfully read 0 records (35863 bytes) from:
"/json-pcr/pcr-000001.json"

I was expecting to get

( "some value", 133 )

Any idea on what I am doing wrong?




On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <michael_segel@hotmail.com>wrote:

> I think you have a misconception of HBase.
>
> You don't need to actually have mutable data for it to be effective.
> The key is that you need to have access to specific records and work a
> very small subset of the data and not the complete data set.
>
>
> On Jun 13, 2013, at 11:59 AM, Tecno Brain <cerebrotecnologico@gmail.com>
> wrote:
>
> Hi Mike,
>
> Yes, I also have thought about HBase or Cassandra but my data is pretty
> much a snapshot, it does not require updates. Most of my aggregations will
> also need to be computed once and won't change over time with the exception
> of some aggregation that is based on the last N days of data.  Should I
> still consider HBase ? I think that probably it will be good for the
> aggregated data.
>
> I have no idea what are sequence files, but I will take a look.  My raw
> data is stored in the cloud, not in my Hadoop cluster.
>
> I'll keep looking at Pig with ElephantBird.
> Thanks,
>
> -Jorge
>
>
>
>
>
> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <michael_segel@hotmail.com>wrote:
>
>> Hi..
>>
>> Have you thought about HBase?
>>
>> I would suggest that if you're using Hive or Pig, to look at taking these
>> files and putting the JSON records in to a sequence file.
>> Or set of sequence files.... (Then look at HBase to help index them...)
>> 200KB is small.
>>
>> That would be the same for either pig/hive.
>>
>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice.
>> And yes you get each record as a row, however you can always flatten them
>> as needed.
>>
>> Hive?
>> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward
>> Capriolo could give you a better answer.
>> Going from memory, I don't know that there is a good SerDe that would
>> write JSON, just read it. (Hive)
>>
>> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
>> and biased.
>>
>> I think you're on the right track or at least train of thought.
>>
>> HTH
>>
>> -Mike
>>
>>
>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <cerebrotecnologico@gmail.com>
>> wrote:
>>
>> Hello,
>>    I'm new to Hadoop.
>>    I have a large quantity of JSON documents with a structure similar to
>> what is shown below.
>>
>>    {
>>      g : "some-group-identifier",
>>      sg: "some-subgroup-identifier",
>>      j      : "some-job-identifier",
>>      page     : 23,
>>      ... // other fields omitted
>>      important-data : [
>>          {
>>            f1  : "abc",
>>            f2  : "a",
>>            f3  : "/"
>>            ...
>>          },
>>          ...
>>          {
>>            f1 : "xyz",
>>            f2  : "q",
>>            f3  : "/",
>>            ...
>>          },
>>      ],
>>     ... // other fields omitted
>>      other-important-data : [
>>         {
>>            x1  : "ford",
>>            x2  : "green",
>>            x3  : 35
>>            map : {
>>                "free-field" : "value",
>>                "other-free-field" : value2"
>>               }
>>          },
>>          ...
>>          {
>>            x1 : "vw",
>>            x2  : "red",
>>            x3  : 54,
>>            ...
>>          },
>>      ]
>>    },
>> }
>>
>>
>> Each file contains a single JSON document (gzip compressed, and roughly
>> about 200KB uncompressed of pretty-printed json text per document)
>>
>> I am interested in analyzing only the  "important-data" array and the
>> "other-important-data" array.
>> My source data would ideally be easier to analyze if it looked like a
>> couple of tables with a fixed set of columns. Only the column "map" would
>> be a complex column, all others would be primitives.
>>
>> ( g, sg, j, page, f1, f2, f3 )
>>
>> ( g, sg, j, page, x1, x2, x3, map )
>>
>> So, for each JSON document, I would like to "create" several rows, but I
>> would like to avoid the intermediate step of persisting -and duplicating-
>> the "flattened" data.
>>
>> In order to avoid persisting the data flattened, I thought I had to write
>> my own map-reduce in Java code, but discovered that others have had the
>> same problem of using JSON as the source and there are somewhat "standard"
>> solutions.
>>
>> By reading about the SerDe approach for Hive I get the impression that
>> each JSON document is transformed into a single "row" of the table with
>> some columns being an array, a map of other nested structures.
>> a) Is there a way to break each JSON document into several "rows" for a
>> Hive external table?
>> b) It seems there are too many JSON SerDe libraries! Is there any of them
>> considered the de-facto standard?
>>
>> The Pig approach seems also promising using Elephant Bird Do anybody has
>> pointers to more user documentation on this project? Or is browsing through
>> the examples in GitHub my only source?
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Mime
View raw message