incubator-blur-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikumar Govindarajan <ravikumar.govindara...@gmail.com>
Subject Re: Short circuit reads and hadoop
Date Tue, 15 Oct 2013 17:48:19 GMT
Sorry, some truncated content


1. Specifying hints to the namenode about the favored set of datanodes
    [HDFS-2576] when writing a file in hadoop.

    The downside of such a feature is a datanode failure. HDFS Balancer and
    HBase Balancer will contend for re-distribution of blocks

    Facebook's HBase port in GitHub has this patch in production and adjusts
    with conflicts by periodically running a cron process.

    Even running HDFS Balancer when all Shards are online can screw the
    carefully planned layed. So there is HDFS-4420 for this, to exclude
certain
    sub-trees from balancing

2. There is another interesting HDFS-4606 yet to begin, which will aim to
move
    the block to local datanode, when it detects that it has performed a
remote
    read.

    This is will be the best when completed, without requiring any plumbing
code, whatsoever from Blur, HBase etc...

One big downside is the effect it has on the local data-node which will
already be serving data, in addition to performing disk-writes as a result
of remote reads

If you are not already aware of it or have alternate ideas for Blur, please
let me know your thoughts

--
Ravi


On Tue, Oct 15, 2013 at 11:07 PM, Ravikumar Govindarajan <
ravikumar.govindarajan@gmail.com> wrote:

> Actually, what I unearthed after long-time fishing is this
>
> 1. Specifying hints to the namenode about the favored set of datanodes
>     [HDFS-2576] when writing a file in hadoop.
>
>     The downside of such a feature is a datanode failure. HDFS Balancer and
>
>      Facebook's HBase port in GitHub has this patch in production
>
> 2.
>
>
>
>
> On Tue, Oct 15, 2013 at 7:15 PM, Aaron McCurry <amccurry@gmail.com> wrote:
>
>> On Tue, Oct 15, 2013 at 9:25 AM, Ravikumar Govindarajan <
>> ravikumar.govindarajan@gmail.com> wrote:
>>
>> > Your idea looks great.
>> >
>> > But I am afraid I don't seem to grasp it fully. I am attempting to
>> > understand your suggestions. So please bear for a moment.
>> >
>> > When co-locating shard-server and data-nodes
>> >
>> > 1. Every shard-server will attempt to write a local copy
>> > 2. Proceed with default settings of hadoop [remote rack etc...] for next
>> > copies
>> >
>> > Ideally, there are 2 problems when writing a file
>> >
>> > 1. Unable to write a local copy. [Lack of sufficient storage etc...]
>> Hadoop
>> > deflects the
>> >
>>
>> True, this might happen if the cluster is low on storage and there are
>> local disk failures.  If the cluster is in this kind of condition and
>> running the normal Hadoop balancer won't help (reduce the local storage by
>> migrating to another machine) then it's likely we can't do anything to
>> help
>> the situation.
>>
>>
>> >     write to some other destination, internally.
>> > 2. Shard-server failure. Since this is stateless, it will most likely
>> be a
>> > hardware    failure/planned-maintenance.
>> >
>>
>> Yes this is true.
>>
>>
>> >
>> > Instead of asynchronously re-balancing every write[moving
>> blocks-to-shard],
>> > is it possible for us to trap into the above cases alone?
>> >
>>
>> Not sure what you mean here.
>>
>>
>> >
>> > BTW, how do we re-arrange shard layout when shards are added/removed.
>> >
>> > I looked at 0.20 code and it seems to move shards too much. That could
>> be
>> > detrimental for what we are discussing now, right?
>> >
>>
>> I have fixed this issue or at least improved it.
>>
>> https://issues.apache.org/jira/browse/BLUR-260
>>
>> This implementation will only move the shards that are down, and will
>> never
>> move any more shards than is necessary to maintain a properly balanced
>> shard cluster.
>>
>> Basically it recalculates an optimal layout when the number of servers
>> changes.  It will likely need to be improved by optimizing the location of
>> the shard (which server) by figuring out what server that can serve the
>> shard has the most blocks from the indexes of the shard.  Doing this
>> should
>> minimize block movement.
>>
>>
>> Also had another thought, if we follow something similar to what HBase
>> does
>> and perform a full optimization on each shard once a day.  That would
>> locate the data to the local server without us having to do any other
>> work.
>>  Of course this assumes that there is enough space locally and that there
>> has been some change in the index in the last 24 hours to warrant doing
>> the
>> merge.
>>
>> Aaron
>>
>>
>> > --
>> > Ravi
>> >
>> >
>> >
>> >
>> > On Sun, Oct 13, 2013 at 12:51 AM, Aaron McCurry <amccurry@gmail.com>
>> > wrote:
>> >
>> > > On Fri, Oct 11, 2013 at 11:12 AM, Ravikumar Govindarajan <
>> > > ravikumar.govindarajan@gmail.com> wrote:
>> > >
>> > > > I came across this interesting JIRA in hadoop
>> > > > https://issues.apache.org/jira/browse/HDFS-385
>> > > >
>> > > > In essence, it allows us more control over where blocks are placed.
>> > > >
>> > > > A BlockPlacementPolicy to optimistically write all blocks of a given
>> > > > index-file into same set of datanodes could be highly helpful.
>> > > >
>> > > > Co-locating shard-servers and datanodes along with short-circuit
>> reads
>> > > > should improve greatly. We can always utilize the file-system cache
>> for
>> > > > local files. In case a given file is not served locally, then
>> > > shard-servers
>> > > > can use the block-cache
>> > > >
>> > > > Do you see some positives in such an approach?
>> > > >
>> > >
>> > > I'm sure that there will be some performance improvements my using
>> local
>> > > reads for accessing HDFS.  The areas I would assume to see the biggest
>> > > increases in performance would be merging and fetching data for
>> > retrieval.
>> > >  Even with accessing local drives, the search time would likely not be
>> > > improved assuming the index is hot.  One of the reasons for doing
>> > short-cut
>> > > reads is to make use of the OS file system cache and since Blur
>> already
>> > > uses a filesystem cache it might just be duplicating functionality.
>> > >  Another big reason for short-cut reads is to reduce network traffic.
>> > >
>> > > Given the scenario when another server has gone down.  If the shard
>> > server
>> > > is in an opening state we would have to change the Blur layout system
>> to
>> > > only fail to the server that contains the data for the index.  This
>> might
>> > > have some problems because the layout of the shards are based on how
>> many
>> > > shards are online in the existing shards not where they are located.
>> > >
>> > > So in a perfect world if the shard fails to the correct server it
>> would
>> > > reduce the amount of network traffic (to near 0 if it was perfect).
>>  So
>> > in
>> > > short it could work, but it might be very hard.
>> > >
>> > > Assuming for a moment that the system is not dealing with a failure
>> and
>> > > assuming that the shard server is also running a datanode.  The first
>> > > replica for HDFS is the local datanode, so if we just configure Hadoop
>> > for
>> > > the short-cut reads we would already get the benefits for data
>> fetching
>> > and
>> > > merging.
>> > >
>> > > I had another idea that I wanted to run you.
>> > >
>> > > What if we instead of actively placing blocks during writes with the a
>> > > Hadoop block layout policy, we write something like the Hadoop
>> balancer.
>> > >  We get Hadoop to move the blocks to the shard server (datanode) that
>> is
>> > > hosting the shard.  That way it would be asynchronous and after a
>> > failure /
>> > > shard movement it would relocate the blocks and still use a short-cut
>> > read.
>> > >  Kind of like the blocks following the shard around, instead of
>> deciding
>> > up
>> > > front where the data for the shard should be located.  If this is what
>> > you
>> > > were thinking I'm sorry not understanding your suggestion.
>> > >
>> > > Aaron
>> > >
>> > >
>> > >
>> > > >
>> > > > --
>> > > > Ravi
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message