Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 37F8410FA6 for ; Sat, 21 Sep 2013 23:08:56 +0000 (UTC) Received: (qmail 15059 invoked by uid 500); 21 Sep 2013 23:08:50 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 14856 invoked by uid 500); 21 Sep 2013 23:08:50 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 14849 invoked by uid 99); 21 Sep 2013 23:08:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 21 Sep 2013 23:08:50 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of john.lilley@redpoint.net designates 206.225.164.222 as permitted sender) Received: from [206.225.164.222] (HELO hub021-nj-6.exch021.serverdata.net) (206.225.164.222) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 21 Sep 2013 23:08:39 +0000 Received: from MBX021-E3-NJ-2.exch021.domain.local ([10.240.4.78]) by HUB021-NJ-6.exch021.domain.local ([10.240.4.92]) with mapi id 14.03.0123.003; Sat, 21 Sep 2013 16:08:18 -0700 From: John Lilley To: "user@hadoop.apache.org" Subject: RE: How to best decide mapper output/reducer input for a huge string? Thread-Topic: How to best decide mapper output/reducer input for a huge string? Thread-Index: AQHOtpSOclquHX0jlUux7odm+wu9DpnQN6KAgAACFQCAAAj2AIAAC4UAgACCtPA= Date: Sat, 21 Sep 2013 23:08:18 +0000 Message-ID: <869970D71E26D7498BDAC4E1CA92226B86D38B39@MBX021-E3-NJ-2.exch021.domain.local> References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [173.160.43.58] Content-Type: multipart/alternative; boundary="_000_869970D71E26D7498BDAC4E1CA92226B86D38B39MBX021E3NJ2exch_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_869970D71E26D7498BDAC4E1CA92226B86D38B39MBX021E3NJ2exch_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Pavan, How large are the rows in HBase? 22 million rows is not very much but you = mentioned "huge strings". Can you tell which part of the processing is the= limiting factor (read from HBase, mapper output, reducers)? John From: Pavan Sudheendra [mailto:pavan0591@gmail.com] Sent: Saturday, September 21, 2013 2:17 AM To: user@hadoop.apache.org Subject: Re: How to best decide mapper output/reducer input for a huge stri= ng? No, I don't have a combiner in place. Is it necessary? How do I make my map= output compressed? Yes, the Tables in HBase are compressed. Although, there's no real bottleneck, the time it takes to process the enti= re table is huge. I have to constantly check if i can optimize it somehow.. Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see an= y thing wrong with my design? Does it require any kind of re-work? Thank yo= u so much for helping.. On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota > wrote: One thing that comes to mind is that your keys are Strings which are highly= inefficient. You might get a lot better performance if you write a custom = writable for your Key object using the appropriate data types. For example,= use a long (LongWritable) for timestamps. This should make (de)serializati= on a lot faster. If HouseHoldId is an integer, your speed of comparisons fo= r sorting will also go up. Ensure that your map output's are being compressed. Are your tables in HBas= e compressed? Do you have a combiner? Have you been able to profile your code to see where the bottlenecks are? On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra > wrote: Hi Pradeep, Yes.. Basically i'm only writing the key part as the map output.. The V of = is not of much use to me.. But i'm hoping to change that if it leads = to faster execution.. I'm kind of a newbie so looking to make the map/reduc= e job run a lot faster.. Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But s= eems if i write a map output for each and every row of a 19 m row HBase tab= le, its taking nearly a day to complete.. (21 mappers and 21 reducers) I have looked at both Pig/Hive to do the job but i'm supposed to do this vi= a a MR job.. So, cannot use either of that.. Do you recommend me to try som= ething if i have the data in that format? On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota > wrote: I'm sorry but I don't understand your question. Is the output of the mapper= you're describing the key portion? If it is the key, then your data should= already be sorted by HouseHoldId since it occurs first in your key. The SortComparator will tell Hadoop how to sort your data. So you use this = if you have a need for a non lexical sort order. The GroupingComparator wil= l tell Hadoop how to group your data for the reducer. All KV-pairs from the= same group will be given to the same Reducer. If your reduce computation needs all the KV-pairs for the same HouseHoldId,= then you will need to write a GroupingComparator. Also, have you considered using a higher level abstraction on Hadoop such a= s Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT = easier to write in these languages. Hope this helps! - Pradeep On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra > wrote: I need to improve my MR jobs which uses HBase as source as well as sink.. Basically, i'm reading data from 3 HBase Tables in the mapper, writing them= out as one huge string for the reducer to do some computation and dump int= o a HBase Table.. Table1 ~ 19 million rows. Table2 ~ 2 million rows. Table3 ~ 900,000 rows. The output of the mapper is something like this : HouseHoldId contentID name duration genre type channelId personId televisio= nID timestamp I'm interested in sorting it on the basis of the HouseHoldID value so i'm u= sing this technique. I'm not interested in the V part of pair so i'm kind o= f ignoring it. My mapper class is defined as follows: public static class AnalyzeMapper extends TableMapper { = } For my MR job to be completed, it takes 22 hours to complete which is not d= esirable at all. I'm supposed to optimize this somehow to run a lot faster = somehow.. scan.setCaching(750); scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob ( Table1, // input HBase tab= le name scan, AnalyzeMapper.class, // mapper Text.class, // mapper ou= tput key IntWritable.class, // mapper ou= tput value job); TableMapReduceUtil.initTableReducerJob( OutputTable, // outp= ut table AnalyzeReducerTable.class, // redu= cer class job); job.setNumReduceTasks(RegionCount); My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a = 8 node cloudera cluster. Should i use a custom SortComparator or a Group Comparator? -- Regards- Pavan -- Regards- Pavan -- Regards- Pavan --_000_869970D71E26D7498BDAC4E1CA92226B86D38B39MBX021E3NJ2exch_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Pavan,<= /p>

How large are the rows in= HBase?  22 million rows is not very much but you mentioned “hug= e strings”.  Can you tell which part of the processing is the li= miting factor (read from HBase, mapper output, reducers)?

John

 <= /p>

 <= /p>

From: Pavan Su= dheendra [mailto:pavan0591@gmail.com]
Sent: Saturday, September 21, 2013 2:17 AM
To: user@hadoop.apache.org
Subject: Re: How to best decide mapper output/reducer input for a hu= ge string?

 

No, I don't have a co= mbiner in place. Is it necessary? How do I make my map output compressed? Y= es, the Tables in HBase are compressed.

Although, there's no = real bottleneck, the time it takes to process the entire table is huge. I h= ave to constantly check if i can optimize it somehow..

Oh okay.. I'll implement a Custom Writable.. Apart f= rom that, do you see any thing wrong with my design? Does it require any ki= nd of re-work? Thank you so much for helping..

 

On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota &= lt;pradeepg26@gma= il.com> wrote:

One thing that comes to mind is that your keys are S= trings which are highly inefficient. You might get a lot better performance= if you write a custom writable for your Key object using the appropriate d= ata types. For example, use a long (LongWritable) for timestamps. This should make (de)serialization a lot fa= ster. If HouseHoldId is an integer, your speed of comparisons for sorting w= ill also go up.

 

Ensure that your map output's are being compressed. = Are your tables in HBase compressed? Do you have a combiner?

 

Have you been able to profile your code to see where= the bottlenecks are?

 

On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra &= lt;pavan0591@gmail= .com> wrote:

Hi Pradeep,

Yes.. Basically i'm o= nly writing the key part as the map output.. The V of <K,V> is not of= much use to me.. But i'm hoping to change that if it leads to faster execu= tion.. I'm kind of a newbie so looking to make the map/reduce job run a lot faster..

Also, yes. It gets sorted by the HouseHoldID which i= s what i needed.. But seems if i write a map output for each and every row = of a 19 m row HBase table, its taking nearly a day to complete.. (21 mapper= s and 21 reducers)

 

I have looked at both Pig/Hive to do the job but i'm= supposed to do this via a MR job.. So, cannot use either of that.. Do you = recommend me to try something if i have the data in that format?=

 

On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota = <pradeepg26@gm= ail.com> wrote:

I'm sorry but I don't understand your question. Is t= he output of the mapper you're describing the key portion? If it is the key= , then your data should already be sorted by HouseHoldId since it occurs fi= rst in your key.

 

The SortComparator will tell Hadoop how to sort your= data. So you use this if you have a need for a non lexical sort order. The= GroupingComparator will tell Hadoop how to group your data for the reducer= . All KV-pairs from the same group will be given to the same Reducer.

 

If your reduce computation needs all the KV-pairs fo= r the same HouseHoldId, then you will need to write a GroupingComparator.

 

Also, have you considered using a higher level abstr= action on Hadoop such as Pig, Hive, Cascading, etc.? The sorting/grouping t= ype of tasks are a LOT easier to write in these languages.

 

Hope this helps!

- Pradeep

 

On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra &= lt;pavan0591@gmail= .com> wrote:

I need to improve my MR jobs which uses HBase as source as well as sink.= .

Basically, i'm reading data from 3 HBase Tables in the mapper, writing t= hem out as one huge string for the reducer to do some computation and dump = into a HBase Table..

Table1 ~ 19 million rows.
Table2 ~ 2 million rows.
Table3 ~ 900,000 rows.

The output of the mapper is something like this :

HouseHoldId contentID name duration genre type channelId personI=
d televisionID timestamp

I'm interested in sorting it on the basis of the HouseHoldID value so i'= m using this technique. I'm not interested in the V part of pair so i'm kin= d of ignoring it. My mapper class is defined as follows:

public static class AnalyzeMapper extends TableMapper<Text, I=
ntWritable> { }

For my MR job to be completed, it takes 22 hours to complete which is no= t desirable at all. I'm supposed to optimize this somehow to run a lot fast= er somehow..

scan.setCaching(750);        =
scan.setCacheBlocks(false); 
TableMapReduceUtil.initTableMapperJob (
          &nbs=
p;            &=
nbsp;           &nbs=
p;   Table1,         =
;  // input HBase table name
          &nbs=
p;            &=
nbsp;           &nbs=
p;   scan,         &=
nbsp;         
          &nbs=
p;            &=
nbsp;           &nbs=
p;   AnalyzeMapper.class,    // mapper
          &nbs=
p;            &=
nbsp;           &nbs=
p;   Text.class,        &=
nbsp;    // mapper output key
          &nbs=
p;             =
            &nb=
sp;  IntWritable.class,      // mapper o=
utput value
          &nbs=
p;            &=
nbsp;           &nbs=
p;   job);
 
          &nbs=
p;     TableMapReduceUtil.initTableReducerJob(
          &nbs=
p;            &=
nbsp;           &nbs=
p;    OutputTable,       =
         // output table=
           &nb=
sp;            =
            &nb=
sp;   AnalyzeReducerTable.class,  // reducer class=
          &nbs=
p;            &=
nbsp;           &nbs=
p;    job);
          &nbs=
p;     job.setNumReduceTasks(RegionCount);  

My HBase Table1 has 21 regions so 21 mappers are spawned. We are running= a 8 node cloudera cluster.

Should i use a custom SortComparator or a Group Comparator? <= /p>



--
Regards-

Pavan

 



--
Regards-

Pavan

 




--
Regards-

Pavan

--_000_869970D71E26D7498BDAC4E1CA92226B86D38B39MBX021E3NJ2exch_--