hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashwanth Kumar <ashwanthku...@googlemail.com>
Subject Re: Input split for a HBase of 80,000 rows?
Date Mon, 26 Aug 2013 09:25:45 GMT
Just click on "Split" that should be fine. It would pick up a key in the
 middle of each region and split them. Split happens like 1 -> 2 -> 4 -> 8
regions and so on. # of regions for a table is something that you should be
able to come up given the # of region servers and size of data that you are
expecting to store on the table.

Bigger number for Caching typically means more data in memory for the
Mapper Task. I guess as long as you have enough memory to store the data
you are fine. May be other experts can help me here.

- Split on the table gives you parallelism since typically each region is
executed on a separate mapper.
- Right Split + Decent Caching can give you best performance on full table
scan jobs. As I already said, beware of the ScannerTimeoutException that
would arise due to very high caching values. You might want to increase the
scanner timeout value in that case.



On Mon, Aug 26, 2013 at 2:42 PM, Pavan Sudheendra <pavan0591@gmail.com>wrote:

> Hi Ashwanth, thanks for the reply..
>
> I went to the HBase Web UI and saw that my table had 1 Online Regions..
> Can you please guide me as to how to do the split on this table? I see the
> UI asking for a region key and a split button... How many splits can i make
> exactly? Can i give two different 'keys' and assume that the table is now
> split into 3? One from beginning to key1, key1 to key2 and key2 to the rest?
>
>
> On Mon, Aug 26, 2013 at 2:36 PM, Ashwanth Kumar <
> ashwanthkumar@googlemail.com> wrote:
>
>> setCaching is setting the value via API, other way is to set it in the
>> job configuration using the Key "hbase.client.scanner.caching".
>>
>> I just realized, given that you have just 1 region Caching wouldn't help
>> much in reducing the time. Splitting might be an ideal solution. Based on
>> your Heap space for every Mapper task try playing with that 1500 value.
>>
>> Word of caution, if you increase it too much, you might see
>> ScannerTimeoutException in your TT Logs.
>>
>>
>> On Mon, Aug 26, 2013 at 2:29 PM, Pavan Sudheendra <pavan0591@gmail.com>wrote:
>>
>>> Hi Ashwanth,
>>> My caching is set to 1500 ..
>>>
>>> scan.setCaching(1500);
>>> scan.setCacheBlocks(false);
>>>
>>> Can i set the number of splits via an API?
>>>
>>>
>>> On Mon, Aug 26, 2013 at 2:22 PM, Ashwanth Kumar <
>>> ashwanthkumar@googlemail.com> wrote:
>>>
>>>> To answer your question - Go to HBase Web UI and you can initiate a
>>>> manual
>>>> split on the table.
>>>>
>>>> But, before you do that. May be you can try increasing your client
>>>> caching
>>>> value (hbase.client.scanner.caching) in your Job.
>>>>
>>>>
>>>> On Mon, Aug 26, 2013 at 2:09 PM, Pavan Sudheendra <pavan0591@gmail.com
>>>> >wrote:
>>>>
>>>> > What is the input split of the HBase Table in this job status?
>>>> >
>>>> > map() completion: 0.0
>>>> > reduce() completion: 0.0
>>>> > Counters: 24
>>>> >         File System Counters
>>>> >                 FILE: Number of bytes read=0
>>>> >                 FILE: Number of bytes written=216030
>>>> >                 FILE: Number of read operations=0
>>>> >                 FILE: Number of large read operations=0
>>>> >                 FILE: Number of write operations=0
>>>> >                 HDFS: Number of bytes read=116
>>>> >                 HDFS: Number of bytes written=0
>>>> >                 HDFS: Number of read operations=1
>>>> >                 HDFS: Number of large read operations=0
>>>> >                 HDFS: Number of write operations=0
>>>> >         Job Counters
>>>> >                 Launched map tasks=1
>>>> >                 Data-local map tasks=1
>>>> >                 Total time spent by all maps in occupied slots
>>>> (ms)=3332
>>>> >         Map-Reduce Framework
>>>> >                 Map input records=45570
>>>> >                 Map output records=45569
>>>> >                 Map output bytes=4682237
>>>> >                 Input split bytes=116
>>>> >                 Combine input records=0
>>>> >                 Combine output records=0
>>>> >                 Spilled Records=0
>>>> >                 CPU time spent (ms)=1142950
>>>> >                 Physical memory (bytes) snapshot=475811840
>>>> >                 Virtual memory (bytes) snapshot=1262202880
>>>> >                 Total committed heap usage (bytes)=370343936
>>>> >
>>>> >
>>>> > My table has 80,000 rows..
>>>> > Is there any way to increase the number of input splits since it takes
>>>> > nearly 30 mins for the map tasks to complete.. very unusual.
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Regards-
>>>> > Pavan
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ashwanth Kumar / ashwanthkumar.in
>>>>
>>>
>>>
>>>
>>> --
>>> Regards-
>>> Pavan
>>>
>>
>>
>>
>> --
>>
>> Ashwanth Kumar / ashwanthkumar.in
>>
>>
>
>
> --
> Regards-
> Pavan
>



-- 

Ashwanth Kumar / ashwanthkumar.in

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message