hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From TuX RaceR <tuxrace...@gmail.com>
Subject Re: hadoop dfs.replication parameter and hbase/performance for random/scanner access
Date Tue, 05 Jan 2010 10:08:57 GMT
Thanks a lot St.Ack for the time you spend to answer user questions and 
for developing this nice piece of software (hbase)

stack wrote:
> The amount of replication should have no effect on either access mode.
>  Whether scanning or random-accessing, only one of the N replicas is
> accessed.  We'll only go to the other versions if there is trouble accessing
> the first.
> So, more replicas will not change the performance profile.
I am not sure if hbase or hadoop is responsible for choosing the 
location of the replica. Having more replica may not avoid the disk 
access random read limitations but it should probably avoid network latency?
If I have and web application with N clients accessing hbase, if one of 
those clients has to get the value for a  key it should be faster to 
access it if the value for that key is stored on that node? (as we avoid 
a network call). But you are right it does not seem I can get around the 
disk random read performance limitations.
> What do you need to improve?  Are both scans and random-reads slow for you?
>   You've seen the performance page up on the wiki (I'm sure you have).
Unfortunately I am not in a position to really benchmark my application 
as I currently can't run it on a true cluster (using a cluster of 
virtual machines would lead to obviously wrong results ;). At this stage 
I am just trying to understand how hbase/hadoop works to avoid big 
mistakes in the design of the architecture. My application currently 
runs in production on a postgresql database: I replicate it over several 
nodes and read access performs better when I have more replicas because 
each node connects to a local database.


View raw message