hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anoop Sam John <anoo...@huawei.com>
Subject RE: Scan vs Put vs Get
Date Thu, 28 Jun 2012 04:56:38 GMT
     How many Gets you batch together in one call? Is this equal to the Scan#setCaching ()
that u are using?
If both are same u can be sure that the the number of NW calls is coming almost same.

Also you are giving random keys in the Gets. The scan will be always sequential. Seems in
your get scenario it is very very random reads resulting in too many reads of HFile block
from HDFS. [Block caching is enabled?]

Also have you tried using Bloom filters?  ROW blooms might improve your get performance.

From: Jean-Marc Spaggiari [jean-marc@spaggiari.org]
Sent: Thursday, June 28, 2012 5:04 AM
To: user
Subject: Scan vs Put vs Get


I have a small piece of code, for testing, which is putting 1B lines
in an existing table, getting 3000 lines and scanning 10000.

The table is one family, one column.

Everything is done randomly. Put with Random key (24 bytes), fixed
family and fixed column names with random content (24 bytes).

Get (batch) is done with random keys and scan with RandomRowFilter.

And here are the results.
Time to insert 1000000 lines: 43 seconds (23255 lines/seconds)
That's correct for my needs based on the poor performances of the
servers in the cluster. I'm fine with the results.

Time to read 3000 lines: 11444.0 mseconds (262 lines/seconds)
This is way to low. I don't understand why. So I tried the random scan
because I'm not able to figure the issue.

Time to read 10000 lines: 108.0 mseconds (92593 lines/seconds)
This it impressive! I have added that after I failed with the get. I
moved from 262 lines per seconds to almost 100K lines/seconds!!! It's

However, I'm still wondering what's wrong with my gets.

The code is very simple. I'm using Get objects that I'm executing in a
Batch. I tried to add a filter but it's not helping. Here is an
extract of the code.

                        for (long l = 0; l < linesToRead; l++)
                                byte[] array1 = new byte[24];
                                for (int i = 0; i < array1.length; i++)
                                                array1[i] = (byte)Math.floor(Math.random()
* 256);
                                Get g = new Get (array1);
                                Object[] results = new Object[gets.size()];
                                System.out.println(new java.util.Date () + " \"gets\" created.");
                                long timeBefore = System.currentTimeMillis();
                        table.batch(gets, results);
                        long timeAfter = System.currentTimeMillis();

                        float duration = timeAfter - timeBefore;
                        System.out.println ("Time to read " + gets.size() + " lines : "
+ duration + " mseconds (" + Math.round(((float)linesToRead /
(duration / 1000))) + " lines/seconds)");

What's wrong with it? I can't add the setBatch neither I can add
setCaching because it's not a scan. I tried with different numbers of
gets but it's almost always the same speed. Am I using it the wrong
way? Does anyone have any advice to improve that?


View raw message