Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of fancyerii@gmail.com designates
 209.85.217.174 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAFAd71VDE2+DL6mWeBvuD1MCjVf=4dnadOCbwt5vu0_c=f8jSw@mail.gmail.com>
References: 
 <CAFAd71UiasTbFrb5HYeVaifFGbC=J1TJtEPgpz_wMrPz_t5UgA@mail.gmail.com>
	<CAFAd71VTUF7QcrWHxf5uoV7Vaq-BX4o2wDE4aLa-9ugxMmkydA@mail.gmail.com>
	<CADcMMgEYFJwbWisynXzWPF0MF2h6Wch+HxxUufD6ygSZvWMVGA@mail.gmail.com>
	<CAFAd71VDE2+DL6mWeBvuD1MCjVf=4dnadOCbwt5vu0_c=f8jSw@mail.gmail.com>
Date: Tue, 22 Jul 2014 13:47:16 +0800
Message-ID: 
 <CAFAd71VJoNRNcK4YCTVxJ6K-cqMMgj0-xeF4ECYyAB=yPf8H3w@mail.gmail.com>
Subject: Re: how to do parallel scanning in map reduce using hbase as input?
From: Li Li <fancyerii@gmail.com>
To: dev@hbase.apache.org
Content-Type: text/plain; charset=UTF-8

sorry. I have not finished it.
1. yes, I have 20 concurrent running mappers.
2. I can't add more mappers because I set io.sort.mb to 500mb and if I
set 8 mappers, it hit oov exception and load average is high
3. fast mapper only use 1 minute. following is the statistics
HBase Counters
  REMOTE_RPC_CALLS 0
  RPC_CALLS 523
  RPC_RETRIES 0
  NOT_SERVING_REGION_EXCEPTION 0
  NUM_SCANNER_RESTARTS 0
  MILLIS_BETWEEN_NEXTS 62,415
  BYTES_IN_RESULTS 1,380,694,667
  BYTES_IN_REMOTE_RESULTS 0
  REGIONS_SCANNED 1
  REMOTE_RPC_RETRIES 0

FileSystemCounters
  FILE_BYTES_READ 120,508,552
  HDFS_BYTES_READ 176
  FILE_BYTES_WRITTEN 241,000,600

File Input Format Counters
  Bytes Read 0

Map-Reduce Framework
  Map output materialized bytes 120,448,992
  Combine output records 0
  Map input records 5,208,607
  Physical memory (bytes) snapshot 965,730,304
  Spilled Records 10,417,214
  Map output bytes 282,122,973
  CPU time spent (ms) 82,610
  Total committed heap usage (bytes) 1,061,158,912
  Virtual memory (bytes) snapshot 1,681,047,552
  Combine input records 0
  Map output records 5,208,607
  SPLIT_RAW_BYTES 176

 slow mapper cost 25 minutes
HBase Counters
  REMOTE_RPC_CALLS 0
  RPC_CALLS 2,268
  RPC_RETRIES 0
  NOT_SERVING_REGION_EXCEPTION 0
  NUM_SCANNER_RESTARTS 0
  MILLIS_BETWEEN_NEXTS 907,402
  BYTES_IN_RESULTS 9,459,568,932
  BYTES_IN_REMOTE_RESULTS 0
  REGIONS_SCANNED 1
  REMOTE_RPC_RETRIES 0

FileSystemCounters
  FILE_BYTES_READ 2,274,832,004
  HDFS_BYTES_READ 161
  FILE_BYTES_WRITTEN 3,770,108,961

File Input Format Counters
  Bytes Read 0

Map-Reduce Framework
  Map output materialized bytes 1,495,451,997
  Combine output records 0
  Map input records 22,659,551
  Physical memory (bytes) snapshot 976,842,752
  Spilled Records 57,085,847
  Map output bytes 3,348,373,811
  CPU time spent (ms) 1,134,640
  Total committed heap usage (bytes) 945,291,264
  Virtual memory (bytes) snapshot 1,699,991,552
  Combine input records 0
  Map output records 22,644,687
  SPLIT_RAW_BYTES 161

4. I have about 11 billion rows and it takes 1.3TB(hdfs usage) and the
replication factor is 2
5. for block information,
one column family file:
Name Type Size Replication Block Size Modification Time Permission Owner Group
b8297e0a415a4ddc811009e70aa30371 file 195.43 MB 2 64 MB 2014-07-22
10:16 rw-r--r-- hadoop supergroup
dea1d498ec6d46ea84ad35ea6cc3cf6e file 5.12 GB 2 64 MB 2014-07-20 20:24
rw-r--r-- hadoop supergroup
ee01947bad6f450d89bd71be84d9d60a file 2.68 MB 2 64 MB 2014-07-22 13:18
rw-r--r-- hadoop supergroup

another example
1923bdcf47ed40879ec4a2f6d314167e file 729.43 MB 2 64 MB 2014-07-18
20:32 rw-r--r-- hadoop supergroup
532d56af4457492194c5336f1f1d8359 file 372.27 MB 2 64 MB 2014-07-21
20:55 rw-r--r-- hadoop supergroup
55e92aef7b754059be9fc7e4692832ec file 117.45 MB 2 64 MB 2014-07-22
13:19 rw-r--r-- hadoop supergroup
c927509f280a4cb3bc5c6db2feea5c16 file 7.87 GB 2 64 MB 2014-07-12 06:55
rw-r--r-- hadoop supergroup

6. I have only one column family for this table

7. each row has less than 10 columns

8. region info in web ui
ServerName Num. Stores Num. Storefiles Storefile Size Uncompressed
Storefile Size Index Size Bloom Size
mphbase1,60020,1405730850512 46 103 126528m 126567mb 94993k 329266k
mphbase2,60020,1405730850549 45 100 157746m 157789mb 117250k 432066k
mphbase3,60020,1405730850546 46 46 53592m 53610mb 42858k 110748k
mphbase4,60020,1405730850585 43 101 109790m 109827mb 83236k 295068k
mphbase5,60020,1405730850652 41 81 89073m 89099mb 66622k 243354k

9. url_db has 84 regions

On Tue, Jul 22, 2014 at 1:32 PM, Li Li <fancyerii@gmail.com> wrote:
> 1. yes, I have 20 concurrent running mappers.
> 2. I can't add more mappers because I set io.sort.mb to 500mb and if I
> set 8 mappers, it hit oov exception and load average is high
> 3. fast mapper only use 1 minute. following is the statistics
> HBase Counters
> REMOTE_RPC_CALLS 0
> RPC_CALLS 523
> RPC_RETRIES 0
> NOT_SERVING_REGION_EXCEPTION 0
> NUM_SCANNER_RESTARTS 0
> MILLIS_BETWEEN_NEXTS 62,415
> BYTES_IN_RESULTS 1,380,694,667
> BYTES_IN_REMOTE_RESULTS 0
> REGIONS_SCANNED 1
> REMOTE_RPC_RETRIES 0
>
> FileSystemCounters
> FILE_BYTES_READ 120,508,552
> HDFS_BYTES_READ 176
> FILE_BYTES_WRITTEN 241,000,600
>
> File Input Format Counters
> Bytes Read 0
>
> Map-Reduce Framework
> Map output materialized bytes 120,448,992
> Combine output records 0
> Map input records 5,208,607
> Physical memory (bytes) snapshot 965,730,304
> Spilled Records 10,417,214
> Map output bytes 282,122,973
> CPU time spent (ms) 82,610
> Total committed heap usage (bytes) 1,061,158,912
> Virtual memory (bytes) snapshot 1,681,047,552
> Combine input records 0
> Map output records 5,208,607
> SPLIT_RAW_BYTES 176
>
>
> On Tue, Jul 22, 2014 at 12:11 PM, Stack <stack@duboce.net> wrote:
>> How many regions now?
>>
>> You still have 20 concurrent mappers running?  Are your machines loaded w/
>> 4 map tasks on each?  Can you up the number of concurrent mappers?  Can you
>> get an idea of your scan rates?  Are all map tasks scanning at same rate?
>>  Does one task lag the others?  Do you emit stats on each map task such as
>> rows processed? Can you figure your bottleneck? Are you seeking disk all
>> the time?  Anything else running while this big scan is going on?  How big
>> are your cells?  Do you have one or more column families?  How many columns?
>>
>> For average region size, do du on the hdfs region directories and then sum
>> and divide by region count.
>>
>> St.Ack
>>
>>
>> On Mon, Jul 21, 2014 at 7:30 PM, Li Li <fancyerii@gmail.com> wrote:
>>
>>> anyone could help? now I have about 1.1 billion nodes and it takes 2
>>> hours to finish a map reduce job.
>>>
>>> ---------- Forwarded message ----------
>>> From: Li Li <fancyerii@gmail.com>
>>> Date: Thu, Jun 26, 2014 at 3:34 PM
>>> Subject: how to do parallel scanning in map reduce using hbase as input?
>>> To: user@hbase.apache.org
>>>
>>>
>>> my table has about 700 million rows and about 80 regions. each task
>>> tracker is configured with 4 mappers and 4 reducers at the same time.
>>> The hadoop/hbase cluster has 5 nodes so at the same time, it has 20
>>> mappers running. it takes more than an hour to finish mapper stage.
>>> The hbase cluster's load is very low, about 2,000 request per second.
>>> I think one mapper for a region is too small. How can I run more than
>>> one mapper for a region so that it can take full advantage of
>>> computing resources?
>>>