hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <la...@apache.org>
Subject Re: Bulkloading impacts to block locality (0.94.6)
Date Tue, 13 Aug 2013 04:58:59 GMT
Now that I wrote this, I think we should improve that.
For example we could add an RPC to the regionserver and have the regionserver who would own
the region copy the appropriate part of the file (then the data would be local). Or even simpler,
instead of actually copying the files we could just copy in the reference files and let the
usual compactions take care of the reference files.


-- Lars

----- Original Message -----
From: lars hofhansl <larsh@apache.org>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Sent: Monday, August 12, 2013 9:35 PM
Subject: Re: Bulkloading impacts to block locality (0.94.6)

A write in HDFS (by default) places one copy on the local datanode, another one on a node
in a different rack (when applicable), and a third one on a node in the same rack.
HBase gets data locality by being co-located with the data nodes, so after a compaction all
blocks of the compacted HFile(s) are local.
For bulkload you probably had an external process place the HFiles onto HDFS, and hence the
location of these HFile's blocks are more or less random (from HBase's point of view).

Sometimes the HFiles need to be split again (if they do not fit the current region boundaries).
In that we could be smart and write the split hfiles on the correct data nodes to get data
locality, but it seems we are not doing that.

-- Lars

From: Scott Kuehn <scott.kuehn@opower.com>
To: user@hbase.apache.org 
Sent: Wednesday, August 7, 2013 1:19 PM
Subject: Bulkloading impacts to block locality (0.94.6)

I'd like to improve block locality on a system where nearly 100% of data
ingest is via bulkloading.  Presently,  I measure block locality by
monitoring the hdfsBlocksLocalityIndex metric. On a 10 node cluster with
block replication of 3, the block locality index is about 30%, which is
what I'd expect to see from random block placement.  Running a major
compaction does not significantly improve the locality.

How can I maximize block locality in a bulkloading-based system? 

View raw message