accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arshak Navruzyan <>
Subject Re: ISAM file location vs. read performance
Date Thu, 16 Jan 2014 19:12:04 GMT
I did some manual testing on this to see where HDFS is placing blocks in
relation to the location of the tablets.  I used the following command to
determine where HDFS is replicating the various blocks of the Rfiles.

hadoop fsck /accumulo/tables/a -locations -blocks -files

>From my limited testing, it appears that John's observation that "tserver
with ultimately end up major compacting it's files, ensuring locality" is
indeed true.  In all cases, the node that was responsible for the tablet,
held a copy of all the blocks of that Rfile.

More extensive testing in bigger environments would probably still be
helpful before we write this into the documentation.  Also not sure what
happen during tserver failures/reassignments.

One thing that would make testing much easier is if "getsplits -v" reported
the HDFS location of the tablet.  Right now you have to troll through
!METADATA to figure it out.

On Mon, Jan 13, 2014 at 10:25 AM, Arshak Navruzyan <>wrote:

> Thanks for all the explanations.  Perhaps this is something we should
> clearly spell out in the documentation once all the facts are in.  I'll
> keep a task open for now. (
> On Sun, Jan 12, 2014 at 4:26 PM, Donald Miner <>wrote:
>> HDFS-385 (
>> )
>> is for custom pluggable block placement policies and there has been some
>> talk (i think) about improving mean time to recovering and data locality in
>> hbase.
>> Basically this would allow accumulo to have a policy for its blocks and
>> control its own destiny... Instead of things like the rebalancer screwing
>> things up.
>> I honestly don't know much else about this. Just thought it might be
>> relevant to the conversation.
>> > On Jan 12, 2014, at 6:42 PM, Josh Elser <> wrote:
>> >
>> >
>> >
>> >> On 1/12/14, 6:17 PM, Sean Busbey wrote:
>> >> On Sun, Jan 12, 2014 at 4:42 PM, William Slacum
>> >> < <
>> >>
>> >> wrote:
>> >>
>> >>    Some data on short circuit reads would be great to have.
>> >>
>> >>
>> >> What kind of data are you looking for? Just HDFS read rates? or
>> >> specifically Accumulo when set up to make use of it?
>> >
>> > I believe what Bill means, and what I'm also curious about, is
>> specifically the impact on performance for Accumulo's workload: a merged
>> read over multiple files. An easy test might be to create multiple RFiles
>> (1 to 10 files?) which contain interspersed data. Test some sort of
>> random-read and random-seek+sequential-read workloads, from 1 to 10 RFiles,
>> and with shortcircuit reads on an off.
>> >
>> > Perhaps a slightly more accurate test would be to up the compaction
>> ratio on a table, and then bulk import them to a single table, and then
>> just use the regular client API.
>> >
>> >>    I'm unsure of how correct the "compaction leading to eventual
>> >>    locality" postulation is. It seems, to me at least, that in the case
>> >>    of a multi-block file, the file system would eventually try to
>> >>    distribute those blocks rather than leave them all on a single host.
>> >>
>> >>
>> >>
>> >>
>> >> I know in HBase set ups, it's common to either disable the HDFS
>> Balancer
>> >> or just disable for a namespace containing the part of the filesystem
>> >> that handles HBase. Otherwise, when the blocks are moved off to other
>> >> hosts you get performance degradation until compaction can happen
>> again.
>> >> I would expect the same thing ought to be done for Accumulo.
>> >
>> > AFAIK, HBase also does a lot more in regards to assigning Tablets in
>> regards to the blocks that serve them, no? To my knowledge, Accumulo
>> doesn't do anything like this. I don't want users to think that disabling
>> the HDFS balancer is a good idea for Accumulo unless we have actual
>> evidence.

View raw message