hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: Scan vs Put vs Get
Date Thu, 28 Jun 2012 11:13:39 GMT
Wow. First, thanks a lot all for jumping into this.

Let me try to reply to everyone in a single post.

> How many Gets you batch together in one call
I tried with multiple different values from 10 to 3000 with similar results.
Time to read 10 lines : 181.0 mseconds (55 lines/seconds)
Time to read 100 lines : 484.0 mseconds (207 lines/seconds)
Time to read 1000 lines : 4739.0 mseconds (211 lines/seconds)
Time to read 3000 lines : 13582.0 mseconds (221 lines/seconds)

> Is this equal to the Scan#setCaching () that u are using?
The scan call is done after the get test. So I can't set the cache for
the scan before I do the gets. Also, I tried to run them separatly (On
time only the put, one time only the get, etc.) so I did not find a
way to setup the cache for the get.

> If both are same u can be sure that the the number of NW calls is coming almost same.
Here are the results for 10 000 gets and 10 000 scan.next(). Each time
I access the result to be sure they are sent to the client.
(gets) Time to read 10000 lines : 36620.0 mseconds (273 lines/seconds)
(scan) Time to read 10000 lines : 119.0 mseconds (84034 lines/seconds)

>[Block caching is enabled?]
Good question. I don't know :( Is it enabled by default? How can I
verify or activate it?

> Also have you tried using Bloom filters?
Not yet. They are on page 381 on Lars' book and I'm only on page 168 ;)


> What's the hbase version you're using?
I manually installed 0.94.0. I can try with an other version.

> Is it repeatable?
Yes. I tries many many times by adding some options, closing some
process on the server side, remonving one datanode, adding one, etc. I
can see some small variations, but still in the same range. I was able
to move from 200 rows/second  to 300 rows/second. But that's not
really a significant improvment. Also, here are the results for 7
iterations of the same code.

Time to read 1000 lines : 4171.0 mseconds (240 lines/seconds)
Time to read 1000 lines : 3439.0 mseconds (291 lines/seconds)
Time to read 1000 lines : 3953.0 mseconds (253 lines/seconds)
Time to read 1000 lines : 3801.0 mseconds (263 lines/seconds)
Time to read 1000 lines : 3680.0 mseconds (272 lines/seconds)
Time to read 1000 lines : 3493.0 mseconds (286 lines/seconds)
Time to read 1000 lines : 4549.0 mseconds (220 lines/seconds)

>If the locations are wrong (region moved) you will have a retry loop
I have one dead region. It's a server I brought down few days ago
because it was to slow. But it's still on the hbase web interface.
However, if I look at the table, there is no table region hosted on
this server. Hadoop also was removed from it so it's saying one dead
node.

>Do you have anything in the logs?
Nothing special. Only some "Block cache LRU eviction" entries.

> Could you share as well the code
Eveything is at the end of this post.

>You can also check the cache hit and cache miss statistics that appears on
the UI?
Can you please tell me how I can find that? I was not able to find
that on the web UI. Where should I look?

> In your random scan how many Regions are scanned
I only have 5 regions servers and 12 table regions. So I guess all the
servers are called.


So here is the code for the gets. I removed the KeyOnlyFilter because
it's not improving the results.

JM




http://pastebin.com/K75nFiQk (for syntax highligthing)

HTable table = new HTable(config, "test3");

for (int iteration = 0; iteration < 10; iteration++)
{
	
	final int linesToRead = 1000;
	System.out.println(new java.util.Date () + " Processing iteration " +
iteration + "... ");
	Vector<Get> gets = new Vector<Get>(linesToRead);
	
	for (long l = 0; l < linesToRead; l++)
	{
	byte[] array1 = new byte[24];
	for (int i = 0; i < array1.length; i++)
		array1[i] = (byte)Math.floor(Math.random() * 256);
	Get g = new Get (array1);
	gets.addElement(g);
	
	processed++;
}
Object[] results = new Object[gets.size()];
	
long timeBefore = System.currentTimeMillis();
table.batch(gets, results);
long timeAfter = System.currentTimeMillis();
	
float duration = timeAfter - timeBefore;
System.out.println ("Time to read " + gets.size() + " lines : " +
duration + " mseconds (" + Math.round(((float)linesToRead / (duration
/ 1000))) + " lines/seconds)");
	    		
	
for (int i = 0; i < results.length; i++)
{
	if (results[i] instanceof KeyValue)
		if (!((KeyValue)results[i]).isEmptyColumn())
			System.out.println("Result[" + i + "]: " + results[i]); // co
BatchExample-9-Dump Print all results.
}

2012/6/28, Ramkrishna.S.Vasudevan <ramkrishna.vasudevan@huawei.com>:
> Hi
>
> You can also check the cache hit and cache miss statistics that appears on
> the UI?
>
> In your random scan how many Regions are scanned whereas in gets may be
> many
> due to randomness.
>
> Regards
> Ram
>
>> -----Original Message-----
>> From: N Keywal [mailto:nkeywal@gmail.com]
>> Sent: Thursday, June 28, 2012 2:00 PM
>> To: user@hbase.apache.org
>> Subject: Re: Scan vs Put vs Get
>>
>> Hi Jean-Marc,
>>
>> Interesting.... :-)
>>
>> Added to Anoop questions:
>>
>> What's the hbase version you're using?
>>
>> Is it repeatable, I mean if you try twice the same "gets" with the
>> same client do you have the same results? I'm asking because the
>> client caches the locations.
>>
>> If the locations are wrong (region moved) you will have a retry loop,
>> and it includes a sleep. Do you have anything in the logs?
>>
>> Could you share as well the code you're using to get the ~100 ms time?
>>
>> Cheers,
>>
>> N.
>>
>> On Thu, Jun 28, 2012 at 6:56 AM, Anoop Sam John <anoopsj@huawei.com>
>> wrote:
>> > Hi
>> >     How many Gets you batch together in one call? Is this equal to
>> the Scan#setCaching () that u are using?
>> > If both are same u can be sure that the the number of NW calls is
>> coming almost same.
>> >
>> > Also you are giving random keys in the Gets. The scan will be always
>> sequential. Seems in your get scenario it is very very random reads
>> resulting in too many reads of HFile block from HDFS. [Block caching is
>> enabled?]
>> >
>> > Also have you tried using Bloom filters?  ROW blooms might improve
>> your get performance.
>> >
>> > -Anoop-
>> > ________________________________________
>> > From: Jean-Marc Spaggiari [jean-marc@spaggiari.org]
>> > Sent: Thursday, June 28, 2012 5:04 AM
>> > To: user
>> > Subject: Scan vs Put vs Get
>> >
>> > Hi,
>> >
>> > I have a small piece of code, for testing, which is putting 1B lines
>> > in an existing table, getting 3000 lines and scanning 10000.
>> >
>> > The table is one family, one column.
>> >
>> > Everything is done randomly. Put with Random key (24 bytes), fixed
>> > family and fixed column names with random content (24 bytes).
>> >
>> > Get (batch) is done with random keys and scan with RandomRowFilter.
>> >
>> > And here are the results.
>> > Time to insert 1000000 lines: 43 seconds (23255 lines/seconds)
>> > That's correct for my needs based on the poor performances of the
>> > servers in the cluster. I'm fine with the results.
>> >
>> > Time to read 3000 lines: 11444.0 mseconds (262 lines/seconds)
>> > This is way to low. I don't understand why. So I tried the random
>> scan
>> > because I'm not able to figure the issue.
>> >
>> > Time to read 10000 lines: 108.0 mseconds (92593 lines/seconds)
>> > This it impressive! I have added that after I failed with the get. I
>> > moved from 262 lines per seconds to almost 100K lines/seconds!!! It's
>> > awesome!
>> >
>> > However, I'm still wondering what's wrong with my gets.
>> >
>> > The code is very simple. I'm using Get objects that I'm executing in
>> a
>> > Batch. I tried to add a filter but it's not helping. Here is an
>> > extract of the code.
>> >
>> >                        for (long l = 0; l < linesToRead; l++)
>> >                        {
>> >                                byte[] array1 = new byte[24];
>> >                                for (int i = 0; i < array1.length;
>> i++)
>> >                                                array1[i]
=
>> (byte)Math.floor(Math.random() * 256);
>> >                                Get g = new Get (array1);
>> >                                gets.addElement(g);
>> >                        }
>> >                                Object[] results = new
>> Object[gets.size()];
>> >                                System.out.println(new java.util.Date
>> () + " \"gets\" created.");
>> >                                long timeBefore =
>> System.currentTimeMillis();
>> >                        table.batch(gets, results);
>> >                        long timeAfter = System.currentTimeMillis();
>> >
>> >                        float duration = timeAfter - timeBefore;
>> >                        System.out.println ("Time to read " +
>> gets.size() + " lines : "
>> > + duration + " mseconds (" + Math.round(((float)linesToRead /
>> > (duration / 1000))) + " lines/seconds)");
>> >
>> > What's wrong with it? I can't add the setBatch neither I can add
>> > setCaching because it's not a scan. I tried with different numbers of
>> > gets but it's almost always the same speed. Am I using it the wrong
>> > way? Does anyone have any advice to improve that?
>> >
>> > Thanks,
>> >
>> > JM
>
>

Mime
View raw message