hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dru Jensen <drujen...@gmail.com>
Subject Re: newbie - map reduce not distributing
Date Mon, 04 Aug 2008 16:35:19 GMT
Thanks St.Ack and J-D for your help on this.

On Aug 2, 2008, at 1:08 PM, stack wrote:

> Thank you for persevering Dru.
>
> Indeed, a bug in getStartKeys will make us process all tables that  
> have a column family name in common.
>
> Here's the code:
>
> public byte[][] getStartKeys() throws IOException {
>   final List<byte[]> keyList = new ArrayList<byte[]>();
>
>   MetaScannerVisitor visitor = new MetaScannerVisitor() {
>     public boolean processRow(RowResult rowResult) throws  
> IOException {
>       HRegionInfo info = Writables.getHRegionInfo(
>           rowResult.get(HConstants.COL_REGIONINFO));
>
>       if (!(info.isOffline() || info.isSplit())) {
>         keyList.add(info.getStartKey());
>       }
>       return true;
>     }
>
>   };
>   MetaScanner.metaScan(configuration, visitor, this.tableName);
>   return keyList.toArray(new byte[keyList.size()][]);
> }
>
> The above Visitor is visiting the meta table.  Its checking column  
> family name.  Any region that is not offlined or split gets added to  
> the list of regions.  Its not checking that the region belongs to  
> the wanted table.
>
> Would suggest adding something like a:
>
>       if (!Bytes.equals(rowResult.getRow(), getTableName())) {
>         return true;
>       }
>
> before the column check.
>
> For bonus points, you might want to make it so we stop scanning  
> after we've passed out all regions for the wanted table.
>
> If the above change fixes things, mind making an issue and attaching  
> patch?  Would be nice if we could get it into 0.2.0.
>
> Thanks for being persistent.
>
> St.Ack
>
> Dru Jensen wrote:
>> J-D,
>>
>> I found what is causing the same rows being sent to multiple map  
>> tasks.
>> If you have the same column family name in other tables, the Test  
>> will send the same rows to multiple map reducers.
>>
>> I'm attaching the DEBUG logs and the test class.
>>
>> thanks,
>> Dru
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>> Steps to duplicate:
>>
>> 1. "hadoop dfs -rmr /hbase" and started hbase with no tables.
>>
>> 2. launch hbase shell and create table:
>> create 'test', 'content'
>> put 'test','test','content:test','testing'
>> put 'test','test2','content:test','testing2'
>>
>> 3. Run the Test MapReduce, only one map task is launched.  This is  
>> correct.
>> Jensens-MacBook-2:hadoop drujensen$ ./test.sh Test
>> 08/08/01 10:23:07 INFO mapred.JobClient: Running job:  
>> job_200808010944_0006
>> 08/08/01 10:23:08 INFO mapred.JobClient:  map 0% reduce 0%
>> 08/08/01 10:23:12 INFO mapred.JobClient:  map 100% reduce 0%
>> 08/08/01 10:23:14 INFO mapred.JobClient:  map 100% reduce 100%
>> 08/08/01 10:23:15 INFO mapred.JobClient: Job complete:  
>> job_200808010944_0006
>> 08/08/01 10:23:15 INFO mapred.JobClient: Counters: 13
>> 08/08/01 10:23:15 INFO mapred.JobClient:   Job Counters 08/08/01  
>> 10:23:15 INFO mapred.JobClient:     Launched map tasks=1
>> 08/08/01 10:23:15 INFO mapred.JobClient:     Launched reduce tasks=1
>> 08/08/01 10:23:15 INFO mapred.JobClient:   Map-Reduce Framework
>> 08/08/01 10:23:15 INFO mapred.JobClient:     Map input records=2
>> 08/08/01 10:23:15 INFO mapred.JobClient:     Map output records=0
>> 08/08/01 10:23:15 INFO mapred.JobClient:     Map input bytes=0
>> 08/08/01 10:23:15 INFO mapred.JobClient:     Map output bytes=0
>> 08/08/01 10:23:15 INFO mapred.JobClient:     Combine input records=0
>> 08/08/01 10:23:15 INFO mapred.JobClient:     Combine output records=0
>> 08/08/01 10:23:15 INFO mapred.JobClient:     Reduce input groups=0
>> 08/08/01 10:23:15 INFO mapred.JobClient:     Reduce input records=0
>> 08/08/01 10:23:15 INFO mapred.JobClient:     Reduce output records=0
>> 08/08/01 10:23:15 INFO mapred.JobClient:   File Systems
>> 08/08/01 10:23:15 INFO mapred.JobClient:     Local bytes read=136
>> 08/08/01 10:23:15 INFO mapred.JobClient:     Local bytes written=280
>>
>> 4. create another table with 'content' column family
>> create 'weird', 'content'
>>
>> 5. Run the same test.  Two map tasks are launched even though  
>> column is in a different table.
>> Jensens-MacBook-2:hadoop drujensen$ ./test.sh Test
>> 08/08/01 10:24:15 INFO mapred.JobClient: Running job:  
>> job_200808010944_0007
>> 08/08/01 10:24:16 INFO mapred.JobClient:  map 0% reduce 0%
>> 08/08/01 10:24:19 INFO mapred.JobClient:  map 50% reduce 0%
>> 08/08/01 10:24:21 INFO mapred.JobClient:  map 100% reduce 0%
>> 08/08/01 10:24:26 INFO mapred.JobClient:  map 100% reduce 16%
>> 08/08/01 10:24:27 INFO mapred.JobClient:  map 100% reduce 100%
>> 08/08/01 10:24:28 INFO mapred.JobClient: Job complete:  
>> job_200808010944_0007
>> 08/08/01 10:24:28 INFO mapred.JobClient: Counters: 13
>> 08/08/01 10:24:28 INFO mapred.JobClient:   Job Counters 08/08/01  
>> 10:24:28 INFO mapred.JobClient:     Launched map tasks=2
>> 08/08/01 10:24:28 INFO mapred.JobClient:     Launched reduce tasks=1
>> 08/08/01 10:24:28 INFO mapred.JobClient:   Map-Reduce Framework
>> 08/08/01 10:24:28 INFO mapred.JobClient:     Map input records=4
>> 08/08/01 10:24:28 INFO mapred.JobClient:     Map output records=0
>> 08/08/01 10:24:28 INFO mapred.JobClient:     Map input bytes=0
>> 08/08/01 10:24:28 INFO mapred.JobClient:     Map output bytes=0
>> 08/08/01 10:24:28 INFO mapred.JobClient:     Combine input records=0
>> 08/08/01 10:24:28 INFO mapred.JobClient:     Combine output records=0
>> 08/08/01 10:24:28 INFO mapred.JobClient:     Reduce input groups=0
>> 08/08/01 10:24:28 INFO mapred.JobClient:     Reduce input records=0
>> 08/08/01 10:24:28 INFO mapred.JobClient:     Reduce output records=0
>> 08/08/01 10:24:28 INFO mapred.JobClient:   File Systems
>> 08/08/01 10:24:28 INFO mapred.JobClient:     Local bytes read=136
>> 08/08/01 10:24:28 INFO mapred.JobClient:     Local bytes written=424
>>
>> 6. create another table that doesn't have 'content' column family
>> create 'notweird', 'notcontent'
>>
>> 7. run the test.  Still only 2 map tasks.
>> Jensens-MacBook-2:hadoop drujensen$ ./test.sh Test
>> 08/08/01 10:24:57 INFO mapred.JobClient: Running job:  
>> job_200808010944_0008
>> 08/08/01 10:24:58 INFO mapred.JobClient:  map 0% reduce 0%
>> 08/08/01 10:25:02 INFO mapred.JobClient:  map 50% reduce 0%
>> 08/08/01 10:25:03 INFO mapred.JobClient:  map 100% reduce 0%
>> 08/08/01 10:25:09 INFO mapred.JobClient:  map 100% reduce 100%
>> 08/08/01 10:25:10 INFO mapred.JobClient: Job complete:  
>> job_200808010944_0008
>> 08/08/01 10:25:10 INFO mapred.JobClient: Counters: 13
>> 08/08/01 10:25:10 INFO mapred.JobClient:   Job Counters 08/08/01  
>> 10:25:10 INFO mapred.JobClient:     Launched map tasks=2
>> 08/08/01 10:25:10 INFO mapred.JobClient:     Launched reduce tasks=1
>> 08/08/01 10:25:10 INFO mapred.JobClient:   Map-Reduce Framework
>> 08/08/01 10:25:10 INFO mapred.JobClient:     Map input records=4
>> 08/08/01 10:25:10 INFO mapred.JobClient:     Map output records=0
>> 08/08/01 10:25:10 INFO mapred.JobClient:     Map input bytes=0
>> 08/08/01 10:25:10 INFO mapred.JobClient:     Map output bytes=0
>> 08/08/01 10:25:10 INFO mapred.JobClient:     Combine input records=0
>> 08/08/01 10:25:10 INFO mapred.JobClient:     Combine output records=0
>> 08/08/01 10:25:10 INFO mapred.JobClient:     Reduce input groups=0
>> 08/08/01 10:25:10 INFO mapred.JobClient:     Reduce input records=0
>> 08/08/01 10:25:10 INFO mapred.JobClient:     Reduce output records=0
>> 08/08/01 10:25:10 INFO mapred.JobClient:   File Systems
>> 08/08/01 10:25:10 INFO mapred.JobClient:     Local bytes read=136
>> 08/08/01 10:25:10 INFO mapred.JobClient:     Local bytes written=424
>>
>> 8. create another table with two families.  One called 'content'
>> create 'weirdest', 'content','notcontent'
>>
>> 9. run test.  3 map tasks are launched.
>> Jensens-MacBook-2:hadoop drujensen$ ./test.sh Test
>> 08/08/01 10:25:41 INFO mapred.JobClient: Running job:  
>> job_200808010944_0009
>> 08/08/01 10:25:42 INFO mapred.JobClient:  map 0% reduce 0%
>> 08/08/01 10:25:45 INFO mapred.JobClient:  map 33% reduce 0%
>> 08/08/01 10:25:46 INFO mapred.JobClient:  map 100% reduce 0%
>> 08/08/01 10:25:50 INFO mapred.JobClient:  map 100% reduce 100%
>> 08/08/01 10:25:51 INFO mapred.JobClient: Job complete:  
>> job_200808010944_0009
>> 08/08/01 10:25:51 INFO mapred.JobClient: Counters: 13
>> 08/08/01 10:25:51 INFO mapred.JobClient:   Job Counters 08/08/01  
>> 10:25:51 INFO mapred.JobClient:     Launched map tasks=3
>> 08/08/01 10:25:51 INFO mapred.JobClient:     Launched reduce tasks=1
>> 08/08/01 10:25:51 INFO mapred.JobClient:   Map-Reduce Framework
>> 08/08/01 10:25:51 INFO mapred.JobClient:     Map input records=6
>> 08/08/01 10:25:51 INFO mapred.JobClient:     Map output records=0
>> 08/08/01 10:25:51 INFO mapred.JobClient:     Map input bytes=0
>> 08/08/01 10:25:51 INFO mapred.JobClient:     Map output bytes=0
>> 08/08/01 10:25:51 INFO mapred.JobClient:     Combine input records=0
>> 08/08/01 10:25:51 INFO mapred.JobClient:     Combine output records=0
>> 08/08/01 10:25:51 INFO mapred.JobClient:     Reduce input groups=0
>> 08/08/01 10:25:51 INFO mapred.JobClient:     Reduce input records=0
>> 08/08/01 10:25:51 INFO mapred.JobClient:     Reduce output records=0
>> 08/08/01 10:25:51 INFO mapred.JobClient:   File Systems
>> 08/08/01 10:25:51 INFO mapred.JobClient:     Local bytes read=136
>> 08/08/01 10:25:51 INFO mapred.JobClient:     Local bytes written=568
>>
>>
>> On Jul 31, 2008, at 5:42 PM, Jean-Daniel Cryans wrote:
>>
>>> Dru,
>>>
>>> There is something truly weird with your setup. I would advise  
>>> running your
>>> code (the simple one that only logs the rows) with DEBUG on. See the
>>> faq<http://wiki.apache.org/hadoop/Hbase/FAQ#5>on how to do it. Then
>>> get back with syslog and stdout. This way we will have
>>> more informations on how scanners are handling this.
>>>
>>> Also FYI, I ran the same code as yours with 0.2.0 on my setup and  
>>> had no
>>> problems.
>>>
>>> J-D
>>>
>>> On Thu, Jul 31, 2008 at 7:06 PM, Dru Jensen <drujensen@gmail.com <mailto:drujensen@gmail.com

>>> >> wrote:
>>>
>>>> UPDATE:  I modified the RowCounter example and verified that it  
>>>> is sending
>>>> the same row to multiple map tasks also. Is this a known bug or  
>>>> am I doing
>>>> something truly as(s)inine?  Any help is appreciated.
>>>>
>>>>
>>>> On Jul 30, 2008, at 3:02 PM, Dru Jensen wrote:
>>>>
>>>> J-D,
>>>>>
>>>>> Again, thank you for your help on this.
>>>>>
>>>>> hitting the HBASE Master port 60010:
>>>>> System 1 - 2 regions
>>>>> System 2 - 1 region
>>>>> System 3 - 3 regions
>>>>>
>>>>> In order to demonstrate the behavior I'm seeing, I wrote a test  
>>>>> class.
>>>>>
>>>>> public class Test extends Configured implements Tool {
>>>>>
>>>>>  public static class Map extends TableMap {
>>>>>
>>>>>      @Override
>>>>>      public void map(ImmutableBytesWritable key, RowResult row,
>>>>> OutputCollector output, Reporter r) throws IOException {
>>>>>
>>>>>          String key_str = new String(key.get());
>>>>>          System.out.println("map: key = " + key_str);
>>>>>      }
>>>>>
>>>>>  }
>>>>>
>>>>>  public class Reduce extends TableReduce {
>>>>>
>>>>>      @Override
>>>>>      public void reduce(WritableComparable key, Iterator values,
>>>>> OutputCollector output, Reporter r) throws IOException {
>>>>>
>>>>>      }
>>>>>
>>>>>  }
>>>>>
>>>>>  public int run(String[] args) throws Exception {
>>>>>      JobConf job = new JobConf(getConf(), Test.class);
>>>>>      job.setJobName("Test");
>>>>>
>>>>>      job.setNumMapTasks(4);
>>>>>      job.setNumReduceTasks(1);
>>>>>
>>>>>      Map.initJob("test", "content:", Map.class, HStoreKey.class,
>>>>> HbaseMapWritable.class, job);
>>>>>      Reduce.initJob("test", Reduce.class, job);
>>>>>
>>>>>      JobClient.runJob(job);
>>>>>      return 0;
>>>>>  }
>>>>>
>>>>>  public static void main(String[] args) throws Exception {
>>>>>      int res = ToolRunner.run(new Configuration(), new Test(),  
>>>>> args);
>>>>>      System.exit(res);
>>>>>  }
>>>>> }
>>>>>
>>>>> In hbase shell:
>>>>> create 'test','content'
>>>>> put 'test','test','content:test','testing'
>>>>> put 'test','test2','content:test','testing2'
>>>>>
>>>>>
>>>>> The Hadoop log results:
>>>>> Task Logs: 'task_200807301447_0001_m_000000_0'
>>>>>
>>>>>
>>>>>
>>>>> stdout logs
>>>>> map: key = test
>>>>> map: key = test2
>>>>>
>>>>>
>>>>> stderr logs
>>>>>
>>>>>
>>>>> syslog logs
>>>>> 2008-07-30 14:51:16,410 INFO  
>>>>> org.apache.hadoop.metrics.jvm.JvmMetrics:
>>>>> Initializing JVM Metrics with processName=MAP, sessionId=
>>>>> 2008-07-30 14:51:16,507 INFO org.apache.hadoop.mapred.MapTask:
>>>>> numReduceTasks: 1
>>>>> 2008-07-30 14:51:17,120 INFO  
>>>>> org.apache.hadoop.mapred.TaskRunner: Task
>>>>> 'task_200807301447_0001_m_000000_0' done.
>>>>>
>>>>> Task Logs: 'task_200807301447_0001_m_000001_0'
>>>>>
>>>>>
>>>>>
>>>>> stdout logs
>>>>> map: key = test
>>>>> map: key = test2
>>>>>
>>>>>
>>>>> stderr logs
>>>>>
>>>>>
>>>>> syslog logs
>>>>> 2008-07-30 14:51:16,410 INFO  
>>>>> org.apache.hadoop.metrics.jvm.JvmMetrics:
>>>>> Initializing JVM Metrics with processName=MAP, sessionId=
>>>>> 2008-07-30 14:51:16,509 INFO org.apache.hadoop.mapred.MapTask:
>>>>> numReduceTasks: 1
>>>>> 2008-07-30 14:51:17,118 INFO  
>>>>> org.apache.hadoop.mapred.TaskRunner: Task
>>>>> 'task_200807301447_0001_m_000001_0' done.
>>>>>
>>>>> Tasks 3 and 4 are the same.
>>>>>
>>>>> Each map task is seeing the same rows.  Any help to prevent this  
>>>>> is
>>>>> appreciated.
>>>>>
>>>>> Thanks,
>>>>> Dru
>>>>>
>>>>>
>>>>> On Jul 30, 2008, at 2:22 PM, Jean-Daniel Cryans wrote:
>>>>>
>>>>> Dru,
>>>>>>
>>>>>> It is not supposed to process many times the same rows. Can I  
>>>>>> see the log
>>>>>> you're talking about? Also, how many regions do you have in  
>>>>>> your table?
>>>>>> (info available in the web UI).
>>>>>>
>>>>>> thx
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Wed, Jul 30, 2008 at 5:04 PM, Dru Jensen  
>>>>>> <drujensen@gmail.com <mailto:drujensen@gmail.com>> wrote:
>>>>>>
>>>>>> J-D,
>>>>>>>
>>>>>>> thanks for your quick response.   I have 4 mapping processes
 
>>>>>>> running on
>>>>>>> 3
>>>>>>> systems.
>>>>>>>
>>>>>>> Are the same rows being processed 4 times by each mapping  
>>>>>>> processor?
>>>>>>> According to the logs they are.
>>>>>>>
>>>>>>> When I run a map/reduce against a file, only one row gets  
>>>>>>> logged per
>>>>>>> mapper.  Why would this be different for hbase tables?
>>>>>>>
>>>>>>> I would think only one mapping process would process that one
 
>>>>>>> row and it
>>>>>>> would only show up once in only one log.
>>>>>>> preferable it would be the same system that has the region.
>>>>>>>
>>>>>>> I only want one row to be processed once.  Is there anyway to
 
>>>>>>> change
>>>>>>> this
>>>>>>> behavior without running only 1 mapper?
>>>>>>>
>>>>>>> thanks,
>>>>>>> Dru
>>>>>>>
>>>>>>>
>>>>>>> On Jul 30, 2008, at 1:44 PM, Jean-Daniel Cryans wrote:
>>>>>>>
>>>>>>> Dru,
>>>>>>>
>>>>>>>>
>>>>>>>> The regions will split when achieving a certain threshold
so  
>>>>>>>> if you
>>>>>>>> want
>>>>>>>> your computing to be distributed, you will have to have more
 
>>>>>>>> data.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Wed, Jul 30, 2008 at 4:36 PM, Dru Jensen <drujensen@gmail.com

>>>>>>>>  <mailto:drujensen@gmail.com>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I created a map/reduce process by extending the TableMap
and
>>>>>>>>> TableReduce
>>>>>>>>> API but for some reason
>>>>>>>>> when I run multiple mappers, in the logs its showing
that  
>>>>>>>>> the same
>>>>>>>>> rows
>>>>>>>>> are
>>>>>>>>> being processed by each Mapper.
>>>>>>>>>
>>>>>>>>> When I say logs, I mean in the hadoop task tracker  
>>>>>>>>> (localhost:50030)
>>>>>>>>> and
>>>>>>>>> drilling down into the logs.
>>>>>>>>>
>>>>>>>>> Do I need to manually perform a TableSplit or is this
 
>>>>>>>>> supposed to be
>>>>>>>>> done
>>>>>>>>> automatically?
>>>>>>>>>
>>>>>>>>> If its something I need to do manually, can someone point
me  
>>>>>>>>> to some
>>>>>>>>> sample
>>>>>>>>> code?
>>>>>>>>>
>>>>>>>>> If its supposed to be automatic and each mapper was supposed
 
>>>>>>>>> to get
>>>>>>>>> its
>>>>>>>>> own
>>>>>>>>> set of rows,
>>>>>>>>> should I write up a bug for this?  I using trunk 0.2.0
on  
>>>>>>>>> hadoop trunk
>>>>>>>>> 0.17.2.
>>>>>>>>>
>>>>>>>>> thanks,
>>>>>>>>> Dru
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>
>> =
>


Mime
View raw message