Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@accumulo.apache.org
MIME-Version: 1.0
In-Reply-To: <5592B824.4040509@gmail.com>
References: 
 <CAGUtCHrSg3-6nWK02gM9yZj0MvriCuTp1d-sNWbN4cS3=wDESQ@mail.gmail.com>
	<CAGUtCHpWb7j-SwEeZ1b7fc1gL55_UJzL=SzTFAPKD+EdQ5_tZA@mail.gmail.com>
	<5592B824.4040509@gmail.com>
Date: Tue, 30 Jun 2015 12:10:22 -0400
Message-ID: 
 <CAGUtCHqMb8bTYsdDYM+3GxssPkGgB19wOumMud29fgeuPFXtzQ@mail.gmail.com>
Subject: Re: [DISCUSS] HDFS operation to support Accumulo locality
From: Keith Turner <keith@deenlo.com>
To: Accumulo Dev List <dev@accumulo.apache.org>
Content-Type: multipart/alternative; boundary=089e0111c0167d7b5b0519be71bd

--089e0111c0167d7b5b0519be71bd
Content-Type: text/plain; charset=UTF-8

On Tue, Jun 30, 2015 at 11:39 AM, Josh Elser <josh.elser@gmail.com> wrote:

> Sorry in advance if I derail this, but I'm not sure what it would take to
> actually implement such an operation. The initial pushback might just be
> "use the block locations and assign the tablet yourself", since that's
> essentially what HBase does (not suggesting there isn't something better to
> do, just a hunch).
>
> IMO, we don't have a lot of information on locality presently. I was
> thinking it would be nice to create a tool to help us understand locality
> at all.
>
> My guess is that after this, our next big gain would be choosing a better
> candidate for where to move a tablet in the case of rebalancing, splits and
> previous-server failure (pretty much all of the times that we aren't/can't
> put the tablet back to its previous loc). I'm not sure how far this would
> get us combined with the favored nodes API, e.g. a Tablet has some favored
> datanodes which we include in the HDFS calls and we can try to put the
> tablet on one of those nodes and assume that HDFS will have the blocks
> there.
>
> tl;dr I'd want to have examples of how that the current API is
> insufficient before lobbying for new HDFS APIs.


Having some examples of how the status quo is insufficient is a good idea.
I was trying to think of situations where there are no suitable nodes that
have *all* of a tablets file blocks local.  In these situations the best we
can hope for is a node that has the largest subset of a tablets file
blocks. I think the following scenarios can cause this situation where
there is no node that has all tablet file blocks.

 * Added X new tablet servers.  Tablets moved inorder to evenly spread
tablets.
 * A lot of tablets in a table just split.  Inorder to evenly spread
tablets across cluster, need to move them.
 * Decommissioned X tablet servers.  Tablets moved inorder to evenly spread
tablets.
 * A tablets has been hosted on multiple tablet servers and as a result
there is no single datanode that has all of its file blocks.
 * Tablet servers run on a subset of the datanodes.  Is the ratio of
tservers to datanodes goes lower the ability to find a datanode with many
of a tablets file blocks goes down.
 * Decommissioning or adding datanodes could also throw off a tablets
locality.

Are there other cases I am missing?


>
>
> Keith Turner wrote:
>
>> I just thought of one potential issue with this.  The same file can be
>> shared by multiple tablets on different tservers.   If there are more than
>> 3 tablets sharing a file, it could cause problems if all of them request a
>> local replica.  So if hdfs had this operation, Accumulo would have to be
>> careful about which files it requested local blocks for.
>>
>> On Tue, Jun 30, 2015 at 11:00 AM, Keith Turner<keith@deenlo.com>  wrote:
>>
>>  There was a discussion on IRC about balancing and locality yesterday. I
>>> was thinking about the locallity problem, and started thinking about the
>>> possibility of having a HDFS operation that would force a file to have
>>> local replicas. I think approach this has the following pros over
>>> forcing a
>>> compaction.
>>>
>>>    * Only one replica is copied across the network.
>>>    * Avoids decompressing, deserializing, serializing, and compressing
>>> data.
>>>
>>> The tricky part about this approach is that Accumulo needs to decide when
>>> to ask HDFS to make a file local. This decision could be based on a
>>> function of the file size and number of recent accesses.
>>>
>>> We could avoid decompressing, deserializing, etc today by just copying
>>> (not compacting) frequently accessed files. However this would write 3
>>> replicas where a HDFS operation would only write one.
>>>
>>> Note for the assertion that only one replica would need to be copied I
>>> was
>>> thinking of following 3 initial conditions.  I am assuming we want to
>>> avoid
>>> all three replicas on same rack.
>>>
>>>   * Zero replicas on rack : can copy replica to node and drop replica on
>>> another rack.
>>>   * One replica on rack : can copy replica to node and drop any other
>>> replica.
>>>   * Two replicas on rack : can copy replica to node and drop another
>>> replica on same rack.
>>>
>>>
>>>
>>>
>>

--089e0111c0167d7b5b0519be71bd--