accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: virtualize accumulo?
Date Wed, 06 Nov 2013 19:02:15 GMT
Ah, that was you :)

You can find the documentation at:, 
specifically you'd be interested in 
I'll try to see if I can get the documentation links fixed.

HOYA uses Hadoop's YARN to perform this provisioning. It uses HDFS as 
some shared storage, and then leverages the YARN APIs for running across 
a cluster. In actuality, it wouldn't matter whether you're running on 
bare metal or on virtualized hosts.

On 11/6/13, 1:41 PM, Kesten Broughton wrote:
> Thanks a lot for the quick responses.
> So donald, if I get you correctly, you recommend against a hybrid approach, but if multi-tenancy
and resource utilization are big factors (they are) then a pure virtualized approach might
be appropriate?  It's just a far less trodden path.  We are working towards an openstack environment,
which may help with the networking configuration component.
> Hoya looks interesting, but unfortunately all the links are currently 404 landspeeder
(i submitted an issue).
> As far as utilization goes, is hoya pxe/cobbler booting bare-metal from a bare-metal
resource pool?   This would certainly be much slower, but might be suitable.
> Or would we have to load our bare-metal pool with all the resources in our stack and
remove it from one thing (elasticsearch cluster say) and add it to accumulo.
> That might be sane.
> This is good food for thought as we consider our options.
> kesten
> ________________________________________
> From: Donald Miner []
> Sent: Tuesday, November 05, 2013 2:45 PM
> To:
> Subject: Re: virtualize accumulo?
> I think a hybrid approach is probably too much pain than its worth. The configuration
of the networking and the IP addresses across virtual and physical hosts will be challenging
but not impossible. Also, what are you trying to isolate accumulo from? MapReduce perhaps?
A large storm instance? Either way, you'll have to think about how to virtualize and provision
those things, too. Now your host is dealing with VMs and HDFS services. None of these are
really show-stopper excuses, so you really could do what you are trying to do, but you'd be
paving your own way.
> I'm pretty sure I agree with Josh on this one, but wanted to explain the pure virtualization
> The VMWare thing you mentioned might have been this thing:
> (marketing)
(more technical, but less breadth)
> I'm a big proponent of these as they really do solve a couple fundamental problems (disclaimer:
I use to work for Pivotal, who helped pushed this solution). The neat thing they added in
the extensions was the understanding of data locality between TaskTrackers and DataNodes if
they reside on the same physical host in different virtual machines. This means that jobs
would get assigned to TTs within the same "node group", which is nice for a couple reasons.
Most prominantly, it allows you to separate the HDFS and MR services into different VMs while
maintaining data locality. This is good for scaling compute separate from storage, particularly
in a multi-tenant environment. Another cool thing is you can "shut off" the execution environment:
spin down the VMs with the TTs but leave the DNs alone. There are some other things they did
to make this architecture make more sense.
> So getting back to your question, hypothetically, you could have multiple HDFS instances
on the same cluster (neat), each supporting one or more Accumulo instances, each of which
can be handled independently of one another. Your MR and other things can also use VMs and
you have pretty good resource utilization compartmentalization. his would give you multi tenancy
and would allow you to manage separate services running over HDFS as separate clusters. You
could also stop region servers while keeping HDFS (and perhaps MapReduce alive), which could
be interesting if you want to start up a proof of concept but don't need the service to be
live all the time.
> In that VMWare paper they mention that performance actually increases with this DN/TT
separation scheme over bare metal, but be wary of the numbers. There is no doubt overhead
in having a virtualization layer. But, if multi-tenancy and elasticity are important to you,
this could be one way to perform that tradeoff.
> -Don
> On Tue, Nov 5, 2013 at 3:31 PM, Josh Elser <<>>
> Hi Kesten,
> As you likely know (given your arguments against), using virtualization to a Hadoop stack
can introduce some unintended consequences. Hadoop has a lot of heartbeats between processes
to determine system "aliveness". If your infrastructure is overloaded, Hadoop can really suffer
from spikes in latency.
> Accumulo is much the same way, arguably a bit more. Accumulo's processes are very dependent
on maintaining a lock in ZooKeeper (every 30 seconds by default) instead of RPC calls between
DataNodes and NameNodes. Accumulo's node failure tends to be much more expensive than HDFS'
because Accumulo wants to make sure every tablet is available without significant downtime.
Hadoop has multiple replicas for each file so it can be a bit more lazy about noticing failure
and re-replicating. What I've typically heard is that running Accumulo in a virtualized environment
makes administration and use a bit more difficult.
> If you're considering running HDFS on baremetal, I would encourage you do to the same
with Accumulo or investigate something like YARN (really, HOYA
to do dynamic provisioning. Accumulo has the ability to happily scale and run across many
nodes, so you shouldn't have to worry about large installation problems (in other words: one
Accumulo instance should be sufficient for a cluster). YARN/HOYA gives you the dynamic allocations
on top of your cluster to have the ease of spinning up and down Accumulo clusters as you want/need
> On 11/5/13, 3:21 PM, Kesten Broughton wrote:
> I've seen arguments both for and against virtualizing hadoop/hdfs.
> (the arguments for were from vmware :)
> We are considering hdfs on baremetal, with accumulo being virtualized.
> This would serve a fairly constant amount of data but widely varying compute demands.
> Has anyone tried this?  Can anyone share their experience with baremetal/virtualization
with accumulo?
> thanks
> kesten
> (first post)

View raw message