accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: Is Data Locality Helpful? (or why run tserver and datanode on the same box?)
Date Thu, 19 Jun 2014 16:56:07 GMT
I may also be getting this conflated with how reads work. Time for me to 
read some HDFS code.

On 6/19/14, 8:52 AM, Josh Elser wrote:
> I believe this happens via the DfsClient, but you can only expect the
> first block of a file to actually be on the local datanode (assuming
> there is one). Everything else is possible to be remote. Assuming you
> have a proper rack script set up, you would imagine that you'll still
> get at least one rack-local replica (so you'd have a block nearby).
> Interestingly (at least to me), I believe HBase does a bit of work in
> region (tablet) assignments to try to maximize the locality of regions
> WRT the datanode that is hosting the blocks that make up that file. I
> need to dig into their code some day though.
> In general, Accumulo and HBase tend to be relatively comparable to one
> another with performance when properly configured which makes me apt to
> think that data locality can help, but it's not some holy grail (of
> course you won't ever hear me claim anything be in that position). I
> will say that I haven't done any real quantitative analysis either though.
> tl;dr HDFS block locality should not be affecting the functionality of
> Accumulo.
> On 6/19/14, 7:25 AM, Corey Nolet wrote:
>> AFAIK, the locality may not be guaranteed right away unless the data
>> for a
>> tablet was first ingested on the tablet server that is responsible for
>> that
>> tablet, otherwise you'll need to wait for a major compaction to
>> rewrite the
>> RFiles locally on the tablet server. I would assume if the tablet
>> server is
>> not on the same node as the datanode, those files will probably be spread
>> across the cluster as if you were ingesting data from outside the cloud.
>> A recent discussion with Bill Slacum also brought to light a possible
>> problem of the HDFS balancer [1] re-balancing blocks after the fact which
>> could eventually pull blocks onto datanodes that are not local to the
>> tablets. I believe remedy for this was to turn off the balancer or not
>> have
>> it run.
>> [1]
>> On Thu, Jun 19, 2014 at 10:07 AM, David Medinets
>> <>
>> wrote:
>>> At the Accumulo Summit and on a recent client site, there have been
>>> conversations about Data Locality and Accumulo.
>>> I ran an experiment to see that Accumulo can scan tables when the
>>> tserver process is run on a server without a datanode process. I
>>> followed these steps:
>>> 1. Start three node cluster
>>> 2. Load data
>>> 3. Kill datanode on slave1
>>> 4. Wait until Hadoop notices dead node.
>>> 5. Kill tserver on slave2
>>> 6. Wait until Accumulo notices dead node.
>>> 7. Run the accumulo shell on master and slave1 to verify entries can be
>>> scanned.
>>> Accumulo handled this situation just fine. As I expected.
>>> How important (or not) is it to run tserver and datanode on the same
>>> server?
>>> Does the Data Locality implied by running them together exist?
>>> Can the benefit be quantified?

View raw message