Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DA92EE730 for ; Thu, 7 Feb 2013 14:25:24 +0000 (UTC) Received: (qmail 94642 invoked by uid 500); 7 Feb 2013 14:25:19 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 94443 invoked by uid 500); 7 Feb 2013 14:25:19 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 94431 invoked by uid 99); 7 Feb 2013 14:25:18 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Feb 2013 14:25:18 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ouchwhisper@gmail.com designates 209.85.210.173 as permitted sender) Received: from [209.85.210.173] (HELO mail-ia0-f173.google.com) (209.85.210.173) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Feb 2013 14:25:14 +0000 Received: by mail-ia0-f173.google.com with SMTP id h37so2972625iak.32 for ; Thu, 07 Feb 2013 06:24:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=d6b09GFUdm7cPT3ANXDSE7UcTEKYlRbqjOCf3Y92cM4=; b=wjywCj8FX5N4seOQh/o1uX2RKNUc7zCN7CZmf47ingDVod67o9hF6FJwO4/RFwhG4q 7fhovKxlMqdLgC6mDCaSc70Eq8kRN669GbGkN+NHQHO8hLyLs7BVXAvsY+F45pG1PNRA a51BfN9h+qX1EX/lal5IzwWCeS8NKbPnLLPQc7nqOM+csNDUQYwP1H4U8Ckh712z0aRr Jv+Z8bxN3mqJT2XKq6QisV8hCh75COl75xIZf36/585/PRnC0Wl5sThICOOICWEdhwBF zskKen950F0yMDnyC5AtaE2lvtTTlyfvJDkSUUmIDFJ1pNvyuJutaE0yW/p+1AgYN0Fz whMw== MIME-Version: 1.0 X-Received: by 10.50.159.197 with SMTP id xe5mr14961945igb.73.1360247093688; Thu, 07 Feb 2013 06:24:53 -0800 (PST) Received: by 10.42.213.133 with HTTP; Thu, 7 Feb 2013 06:24:53 -0800 (PST) In-Reply-To: References: Date: Thu, 7 Feb 2013 15:24:53 +0100 Message-ID: Subject: Re: MapReduce to load data in HBase From: Panshul Whisper To: user@hbase.apache.org, user@hadoop.apache.org Content-Type: multipart/alternative; boundary=14dae9340bb9d3794e04d52334fb X-Virus-Checked: Checked by ClamAV on apache.org --14dae9340bb9d3794e04d52334fb Content-Type: text/plain; charset=ISO-8859-1 I am using the Map Reduce approach. I was looking into AVRO to create my own custom Data types to pass from Mapper to Reducer. With Avro I need to maintain the schema for all the types of Jason files I am receiving and since there will be many different map reduce methods running, so a different schema for every type. 1. Since the Json schema might change very frequently almost 3 times every month. Is it advisable to use Avro to create custom data types? or I can use the distributed cache and store the Java Object in the cache and pass the key to the object to the Reducer? 2. Will there be any performance issues with using the distributed cache? since the data will be very large and very high speed performance required. Thanking You, Regards, On Thu, Feb 7, 2013 at 2:23 PM, Mohammad Tariq wrote: > Size is not a prob, frequently changing schema might be. > > Warm Regards, > Tariq > https://mtariq.jux.com/ > cloudfront.blogspot.com > > > On Thu, Feb 7, 2013 at 6:25 PM, Panshul Whisper >wrote: > > > Hello, > > > > Thank you for the replies. > > > > I have not used pig yet. I am looking into it. I wanted to implement both > > the approaches. > > Are pig scripts maintainable? Because the Json structure that I will be > > receiving will be changing quite often. Almost 3 times a month. > > I will be processing 24 million Json files per month. > > I am getting one big file with almost 3 million Json files aggregated. > One > > Json per line. I need to process this file and store all values into > HBase. > > > > Thanking You, > > > > > > > > > > On Thu, Feb 7, 2013 at 12:59 PM, Mohammad Tariq > > wrote: > > > > > Good point sir. If Pig fits into Panshul's requirements then it's a > much > > > better option. > > > > > > Warm Regards, > > > Tariq > > > https://mtariq.jux.com/ > > > cloudfront.blogspot.com > > > > > > > > > On Thu, Feb 7, 2013 at 5:25 PM, Damien Hardy > > > wrote: > > > > > > > Hello, > > > > Why not using a PIG script for that ? > > > > make the json file available on HDFS > > > > Load with > > > > > > > > > > > > > > http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/JsonLoader.html > > > > Store with > > > > > > > > > > > > > > http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html > > > > > > > > http://pig.apache.org/docs/r0.10.0/ > > > > > > > > Cheers, > > > > > > > > -- > > > > Damien > > > > > > > > > > > > > > > -- > > Regards, > > Ouch Whisper > > 010101010101 > > > -- Regards, Ouch Whisper 010101010101 --14dae9340bb9d3794e04d52334fb Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I am using the Map Reduce approach. I was looking into AVR= O to create my own custom Data types to pass from Mapper to Reducer.
With Avro I need to maintain the schema for all the types of Jason file= s I am receiving and since there will be many different map reduce methods = running, so a different schema for every type.
1. Since the Json schema might change very frequently almost 3 t= imes every month. Is it advisable to use Avro to create custom data types? = or I can use the distributed cache and store the Java Object in the cache a= nd pass the key to the object to the Reducer?
2. Will there be any performance issues with using the distribut= ed cache? since the data will be very large and very high speed performance= required.

Thanking You,
Regards,


On Thu, Feb 7, 2013 at 2:23 PM, Mohammad Tariq <= ;dontariq@gmail.com= > wrote:
Size is not a prob, frequently changing sche= ma might be.
On Thu, Feb 7, 2013 at 6:25 PM, Panshul Whisper <ouchwhisper@gmail.com>wrote:

> Hello,
>
> Thank you for the replies.
>
> I have not used pig yet. I am looking into it. I wanted to implement b= oth
> the approaches.
> Are pig scripts maintainable? Because the Json structure that I will b= e
> receiving will be changing quite often. Almost 3 times a month.
> I will be processing 24 million Json files per month.
> I am getting one big file with almost 3 million Json files aggregated.= One
> Json per line. I need to process this file and store all values into H= Base.
>
> Thanking You,
>
>
>
>
> On Thu, Feb 7, 2013 at 12:59 PM, Mohammad Tariq <dontariq@gmail.com>
> wrote:
>
> > Good point sir. If Pig fits into Panshul's requirements then = it's a much
> > better option.
> >
> > Warm Regards,
> > Tariq
> > https://mta= riq.jux.com/
> > clou= dfront.blogspot.com
> >
> >
> > On Thu, Feb 7, 2013 at 5:25 PM, Damien Hardy <dhardy@viadeoteam.com>
> > wrote:
> >
> > > Hello,
> > > Why not using a PIG script for that ?
> > > make the json file available on HDFS
> > > Load with
> > >
> > >
> >
> http://pig.apache.org/docs/r0.10.0/ap= i/org/apache/pig/builtin/JsonLoader.html
> > > Store with
> > >
> > >
> >
> http://pig.apache.org/= docs/r0.10.0/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html<= br> > > >
> > > http://pig.apache.org/docs/r0.10.0/
> > >
> > > Cheers,
> > >
> > > --
> > > Damien
> > >
> >
>
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>



--
=
Regards,
Ouch Whisper
010101010101
--14dae9340bb9d3794e04d52334fb--