accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Nolet <>
Subject Re: Is Data Locality Helpful? (or why run tserver and datanode on the same box?)
Date Thu, 19 Jun 2014 14:25:30 GMT
AFAIK, the locality may not be guaranteed right away unless the data for a
tablet was first ingested on the tablet server that is responsible for that
tablet, otherwise you'll need to wait for a major compaction to rewrite the
RFiles locally on the tablet server. I would assume if the tablet server is
not on the same node as the datanode, those files will probably be spread
across the cluster as if you were ingesting data from outside the cloud.

A recent discussion with Bill Slacum also brought to light a possible
problem of the HDFS balancer [1] re-balancing blocks after the fact which
could eventually pull blocks onto datanodes that are not local to the
tablets. I believe remedy for this was to turn off the balancer or not have
it run.


On Thu, Jun 19, 2014 at 10:07 AM, David Medinets <>

> At the Accumulo Summit and on a recent client site, there have been
> conversations about Data Locality and Accumulo.
> I ran an experiment to see that Accumulo can scan tables when the
> tserver process is run on a server without a datanode process. I
> followed these steps:
> 1. Start three node cluster
> 2. Load data
> 3. Kill datanode on slave1
> 4. Wait until Hadoop notices dead node.
> 5. Kill tserver on slave2
> 6. Wait until Accumulo notices dead node.
> 7. Run the accumulo shell on master and slave1 to verify entries can be
> scanned.
> Accumulo handled this situation just fine. As I expected.
> How important (or not) is it to run tserver and datanode on the same
> server?
> Does the Data Locality implied by running them together exist?
> Can the benefit be quantified?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message