drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <prog...@maprtech.com>
Subject Re: Drill on YARN
Date Sun, 27 Mar 2016 05:00:39 GMT
Hi John,

Thanks for the great info! I’ll comment on each part in a separate e-mail.

You can tell from my wording that I share your concern about the Llama approach: it is a cute
short-term hack, but it does not “play well with others” on a production cluster. That’s
why I’m more focused, at least for now, on how we create a production-quality solution using
only “straight” YARN (or Mesos) features, along with Linux features and Drill enhancements.
(We’ll hold the hacks in reserve if playing by the rules doesn’t work out…)

Let’s focus on CPU usage. We actually have three issues to consider.

1. At present, Drill makes overly generous use of threads, meaning that a heavily loaded Drill-bit
will incur excessive context switches. Hanifi is working with the team to devise a solution
to this issue.

2. When run under YARN (actually, when run alongside other apps), Drill should restrict its
CPU usage for the reasons you so clearly articulated.

3. On top of the Drill best efforts, the OS should be able to enforce CPU limits, again for
the reasons you described.

Our thinking is that the work done for issue 1 will directly help us with issue 2. No matter
how Drill is run, it can be given a CPU limit (in terms of whole cores), and Drill will more-or-less
limit its CPU usage accordingly. After the fix for issue 1, we should run approximately the
same number of threads as cores. It is then an easy step to further limit threads to saturate,
say, just 8 of 16 cores. Here we just mean that we’d have no more than ~8 active threads.
In YARN terminology, we use ~8 “virtual cores”.

The next challenge is issue 3: No matter how well Drill limits CPU usage, it can never be
exact: a Java app is simply the wrong level for such fine tuning. Java has no control over
whether its threads run on just 8 physical cores, or whether the OS decides to shift the thread
among processors on each context switch. Even the virtual core usage is approximate; we might
run only 8 worker threads for our allotted 8 virtual cores, but there will still be extra
work in things like heartbeat threads, web server threads and so on.

This is what I meant by YARN resource allocations being a “gentlemen’s agreement”: as
best as I can tell, in original YARN, Map-Reduce processes had just one worker thread and
a known memory footprint. The goal was simply to communicate this fixed footprint to YARN.
That model, while great for simple Map-Reduce jobs, is a bit awkward for more complex apps
like Drill.

Your description of cgroups [1], [2] shows how it can solve issue 3 (CPU usage enforcement).
Cgroups operates at the level of the OS process scheduler and can limit overall CPU, can bind
processes to cores, and so on. Cgroups can also limit memory, but we’ll discuss that separately.

Earlier YARN versions did not enforce container CPU usage. YARN now integrates with cgroups
to for that purpose. [3] (YARN enforces memory usage by killing rogue containers. Perhaps
for this reason, YARN does not use cgroups to enforce memory limits. [3])

Looks like the Yarn admin can limit CPU usage of the entire collection of YARN-managed processes.
In your case, if you have 16 cores and reserve 2 for MapR-FS, you’d allocate 14/16 (87%)
of CPU to YARN.

YARN’s group integration directly solves issue 3: "CGroups … allows containers to be limited
in their resource usage.” [3] Thus, if Drill can work effectively within a CPU “budget”,
YARN with cgroups will turn Drill’s approximate limit into an enforced limit. (We could
just use cgroups alone, but if Drill has too many threads for 16 cores, then it certainly
has too many thread for a reduced count of, say, 8 cores…) This means you can allocate,
say, 7 of YARN’s 14 cores to Drill.

The “soft limit” feature of cgroups you described is a bonus: cgroups will give Drill
more CPU when it is available. [4] It seems that cgroups also can set CPU priority so that,
say, a Finance task can get twice the CPU of a Dev. task. (I’ll speculate that if Drill
were to support multiple Drill-bits per node, and each dept. had its own set of Drill-bits,
then this cgroup feature might be a way to prioritize workloads.)

In summary, we can perhaps replicate on YARN much of the the success you’ve had with Drill
and Mesos. And, we can probably do it without Llama-stye hacks. Of course we’ll have to
do actual tests to see if all of the above works out in practice. Then, as you point out,
we’ve got to translate this discussion into actual documentation that an admin can use to
set up a cluster.

Thanks,

- Paul

[1] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.html
<https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.html>
[2] http://www.janoszen.com/2013/02/06/limiting-linux-processes-cgroups-explained/ <http://www.janoszen.com/2013/02/06/limiting-linux-processes-cgroups-explained/>
[3] https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html
<https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html>
[4] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html
<https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html>
[5] http://hortonworks.com/blog/apache-hadoop-yarn-in-hdp-2-2-isolation-of-cpu-resources-in-your-hadoop-yarn-clusters/
<http://hortonworks.com/blog/apache-hadoop-yarn-in-hdp-2-2-isolation-of-cpu-resources-in-your-hadoop-yarn-clusters/>

> On Mar 26, 2016, at 6:48 AM, John Omernik <john@omernik.com> wrote:
> 
> Paul  -
> 
> Great write-up.
> 
> Your description of Llama and Yarn is both informative and troubling for a
> potential cluster administrator. Looking at this solution, it would appear
> that to use Yarn with Llama, the "citizen" in this case Drill would have to
> be an extremely good citizen and honor all requests from Llama related to
> deallocation and limits on resources while in reality there is no
> enforcement mechanisms.  Not that I don't think Drill is a great tool
> written well by great people, but I don't know if I would want to leave my
> cluster SLAs up to Drill bits doing the self regulation.  Edge cases, etc
> causing a Drillbit to start taking more resources would be very impactful
> to a cluster, and with more and more people going to highly concurrent,
> multi-tenant solutions, this becomes a HUGE challenge.
> 
> Obviously dynamic allocation, flexing up and down to use "spare" cluster
> resources is very important to many cluster/architecture administrators,
> but if I had to guess, SLAs/Workload guarantees would rank higher.
> 
> The Llama approach seems to be to much of a  "house of cards" to me to be
> viable, and I worry that long term it may not be best for a product like
> Drill. Our goal I think should be to play nice with others, if our core
> philosophy in integration is playing nice with others, it will only help
> adoption and people giving it a try.  So back to Drill on Yarn (natively)...
> 
> A few questions around this.  You mention that resource allocations are
> mostly a gentlemen's agreement. Can you explore that a bit more?  I do
> believe there is Cgroup support in Yarn.  (I know the Myriad project is
> looking to use Cgroups).  So is this gentlemen's agreement more about when
> Cgroups is NOT enabled?  Thus it is only the word of the the process
> running in the container in Yarn?  If this is the case, then has there been
> any research on the stability of Cgroups and the implementation in Yarn?
> Basically, Poll: Are you using Yarn? If so are you using Cgroups? If not,
> why? If you are using them, any issues?   This may be helpful in what we
> are looking to do with Drill.
> 
> "Hanifi’s work will allow us to increase or decrease the number of cores we
> consume."  Do you have any JIRAs I can follow on this, I am very interested
> in this. One of the benefits of CGroups in Mesos as it relates to CPU
> shares is a sorta built in Dynamic allocation. And it would be interesting
> to test a Yarn Cluster with Cgroups enabled (once a basic Yarn Aware Drill
> bit is enabled) to see if Yarn reacts the same way.
> 
> Basically, when I run a drillbit on a node with Cgroup isolation enabled in
> Marathon on Mesos, lets say I have 16 total cores on the node. For me, I
> run my Mesos-Agent  with "14" available Vcores... Why? Static allocation of
> 2 vcores for MapR-FS.  Those 14 vcores are now available to tasks on the
> agent.  When I start the drillbit, let's say I allocate 8 vcores to the
> drillbit in Marathon.  Drill runs queries, and let's say the actual CPU
> usage on this node is minimal at the time, Drill, because it is not
> currently CPU aware, takes all the CPU it can. (it will use all 16 cores).
> Query finishes it goes back to 0. But what happens if MapR is heavily using
> it's 2 cores? Well , Cgroups detects contention and limits Drill because
> it's only allocated 8 shares of those 14 it's aware of, this gives priority
> to the MapR operations. Even more so, if there are other Mesos tasks asking
> for CPU shares, Drill's CPU share is being scaled back, not by telling
> Drill it can't use core, but by processing what Drill is trying to do
> slower compared to the rest of the work loads.   I am know I am dumbing
> this down, but that's how I understand Cgroups working. Basically, I was
> very concerned when I first started doing Drill queries in Mesos, and I
> posted to the Mesos list to which some people smarter than I took the time
> to explain things. (Vinode, you are lurking on this list, thanks again!)
> 
> In a way, this is actually a nice side effect of Cgroup Isolation, from
> Drill's perspective it gets all the CPU, and is only scaled back on
> contention.  So, my long explanation here is to bring things back to the
> Yarn/Cgroup/Gentlemen's agreement comment. I'd really want to understand
> this. As a cluster administrator, I can guarantee a level of resources with
> Mesos, Can I get that same guarantee in Yarn? Is it only with certain
> settings?  I just want to be 100% clear that if we go the route, and make
> Drill work on Yarn, that in our documentation/instructions we are explicit
> in what we are giving the user on Yarn.  To me, a bad situation would occur
> when someone thinks all will be well when they run Drill on Yarn, and
> because they are not aware of their own settings (say not enabling Cgroups)
> They blame Drill for breaking something.
> 
> So that falls back to Memory and scaling memory in Drill.  Memory for
> obvious reason can't operate like CPU with Cgroups. You can't allocated all
> memory to all the things, and then scale back if contention. So being a
> complete neophyte on the inner workings of Drill memory. What options would
> exist for allocation of memory.  Could we trigger events that would
> allocate up and down what memory a given drillbit can use so it self
> limits? It currently self limits because we can see the memory settings in
> drill-env.sh.  But what about changing that at a later time?   Is it easier
> to change direct memory limits rather than Heap?
> 
> Hypothesis 1:  If Direct memory isn't allocated (i.e. a drillbit is idle)
> then setting what it could POTENTIALLY use if a query would come in would
> be easier then actually deallocating Heap that's been used. Hypothesis 2:
> Direct Memory, if it's truly deallocated when not in use is more about the
> limit of what it could use not about allocated or deallocating memory.
> Hence a nice step one may be to allow this to change as needed by an
> Application Master in Yarn (or a Mesos Framework)
> 
> If the work to change the limit on Direct Memory usage was easier, it may
> be a good first step, (assuming I am not completely wrong on memory
> allocation) if we have to statically allocate Heap, and it's a big change
> in code to make that dynamic, but Direct Memory is easy to change, that's a
> great first feature, without boiling the ocean. Obviously lots of
> assumptions here, but I am just thinking outloud.
> 
> Paul - When it comes to Application in Yarn, and the containers that the
> Application Master allocates, can containers be joined?  Let's say I am an
> application master, and I allocated 4 CPU Cores and 16 GB of ram to a
> Drillbit. (8 for Heap and 8 for Direct) .  Then at a later time I can add
> more memory to the drill bit.... If my assumptions worked on Direct memory
> in Drill, could my Application master tell the drill bit, ok you can use 16
> GB of direct memory now (i.e. the AM asks the RM to allocate 8 more GB of
> ram on that node, the RM agrees, and allocates another container, can it
> just resize, or would that not work? I guess what I am describing here is
> sorta what Llama is doing... but I am actually talking about the ability to
> enforce the quotas.... This may actually be a question that fits into your
> discussion on resizing Yarn containers more than anything.
> 
> So i just tossed out a bunch of ideas here to keep discussion running.
> Drill Devs, I would love a better understanding of the memory allocation
> mechanisms within Drill. (High level, neophyte here).  I do feel as a
> cluster admin, as I have said, that the Llama approach (now that I
> understand it better) would worry me, especially in a multi-tenant
> cluster.  And as you said Paul, it "feels" hacky.
> 
> Thanks for this discussion, it's a great opportunity for Drill adoption as
> clusters go more and more multi-tenant/multi-use.
> 
> John
> 
> 
> 
> 
> On Fri, Mar 25, 2016 at 5:45 PM, Paul Rogers <progers@maprtech.com <mailto:progers@maprtech.com>>
wrote:
> 
>> Hi Jacques,
>> 
>> Llama is a very investing approach; I read their paper [1] early on; just
>> went back and read it again. Basically, Llama (as best as I can tell) has a
>> two-part solution.
>> 
>> First, Impala is run off-YARN (that is, not in a YARN container). Llama
>> uses “dummy” containers to inform YARN of Impala’s resource usage. They can
>> grow/shrink static allocations by launching more dummy containers. Each
>> dummy container does nothing other than inform off-YARN Impala of the
>> container resources. Rather clever, actually, even if it “abuses the
>> software” a bit.
>> 
>> Secondly, Llama is able to dynamically grab spare YARN resources on each
>> node. Specifically, Llama runs a Node Manager (NM) plugin that watches
>> actual node usage. The plugin detects the free NM resources and informs
>> Impala of them. Impala then consumes the resources as needed. When the NM
>> allocates a new container, the plugin informs Impala which relinquishes the
>> resources. All this works because YARN allocations are mostly a gentleman’s
>> agreement. Again, this is pretty clever, but only one app per node can play
>> this game.
>> 
>> The Llama approach could work for Drill. The benefit is that Drill runs as
>> it does today. Hanifi’s work will allow us to increase or decrease the
>> number of cores we consume. The draw-back is that Drill is not yet ready to
>> play the memory game: it can’t release memory back to the OS when
>> requested. Plus, the approach just smells like a hack.
>> 
>> The “pure-YARN” approach would be to let YARN start/stop the Drill-bits.
>> The user can grow/shrink Drill resources by starting/stopping Drill-bits.
>> (This is simple to do if one ignores data locality and starts each
>> Drill-bit on a separate node. It is a bit more work if one wants to
>> preserve data locality by being rack-aware, or by running multiple
>> drill-bits per node.)
>> 
>> YARN has been working on the ability to resize running containers. (See
>> YARN-1197 - Support changing resources of an allocated container [2]) Once
>> that is available, we can grow/shrink existing Drill-bits (assuming that
>> Drill itself is enhanced as discussed above.) The promise of resizable
>> containers also suggests that the “pure-YARN” approach is workable.
>> 
>> Once resizable containers are available, one more piece is needed to let
>> Drill use free resources. Some cluster-wide component must detect free
>> resources and offer them to applications that want them, deciding how to
>> divvy up the resources between, say, Drill and Impala. The same piece would
>> revoke resources when paying YARN customers need them.
>> 
>> Of course, if the resizable container feature come too late, or does not
>> work well, we still have the option of going off-YARN using the Llama
>> trick. But the Llama trick does nothing to do the cluster-wide coordination
>> discussed above.
>> 
>> So, the thought is: start simple with a “stock” YARN app. Then, we can add
>> bells and whistles as we gain experience and as YARN offers more
>> capabilities.
>> 
>> The nice thing about this approach is that the same idea plays well with
>> Mesos (though the implementation is different).
>> 
>> Thanks,
>> 
>> - Paul
>> 
>> [1] http://cloudera.github.io/llama/ <http://cloudera.github.io/llama/> <http://cloudera.github.io/llama/
<http://cloudera.github.io/llama/>>
>> [2] https://issues.apache.org/jira/browse/YARN-1197 <https://issues.apache.org/jira/browse/YARN-1197>
<
>> https://issues.apache.org/jira/browse/YARN-1197 <https://issues.apache.org/jira/browse/YARN-1197>>
>> 
>>> On Mar 24, 2016, at 2:34 PM, Jacques Nadeau <jacques@dremio.com <mailto:jacques@dremio.com>>
wrote:
>>> 
>>> Your proposed allocation approach makes a lot of sense. I think it will
>>> solve a large number of use cases. Thanks for giving an overview of the
>>> different frameworks. I wonder if they got too focused on the simple use
>>> case....
>>> 
>>> Have you looked at LLama to see whether we could extend it for our needs?
>>> Its Apache licensed and probably has at least a start at a bunch of
>> things
>>> we're trying to do.
>>> 
>>> https://github.com/cloudera/llama <https://github.com/cloudera/llama>
>>> 
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>> 
>>> On Tue, Mar 22, 2016 at 7:42 PM, Paul Rogers <progers@maprtech.com>
>> wrote:
>>> 
>>>> Hi Jacques,
>>>> 
>>>> I’m thinking of “semi-static” allocation at first. Spin up a cluster
of
>>>> Drill-bits, after which the user can add or remove nodes while the
>> cluster
>>>> runs. (The add part is easy, the remove part is a bit tricky since we
>> don’t
>>>> yet have a way to gracefully shut down a Drill-bit.) Once we get the
>> basics
>>>> to work, we can incrementally try out dynamics. For example, someone
>> could
>>>> whip up a script to look at load and use the proposed YARN client app to
>>>> adjust resources. Later, we can fold dynamic load management into the
>>>> solution once we’re sure what folks want.
>>>> 
>>>> I did look at Slider, Twill, Kitten and REEF. Kitten is too basic. I had
>>>> great hope for Slider. But, it turns out that Slider and Weave have each
>>>> built an elaborate framework to isolate us from YARN. The Slider
>> framework
>>>> (written in Python) seems harder to understand than YARN itself. At
>> least,
>>>> one has to be an expert in YARN to understand what all that Python code
>>>> does. And, just looking at the class count in the Twill Javadoc was
>>>> overwhelming. Slider and Twill have to solve the general case. If we
>> build
>>>> our own Java solution, we only have to solve the Drill case, which is
>>>> likely much simpler.
>>>> 
>>>> A bespoke solution would seem to offer some other advantages. It lets us
>>>> do things like integrate ZK monitoring so we can learn of zombie drill
>> bits
>>>> (haven’t exited, but not sending heartbeat messages.) We can also gather
>>>> metrics and historical data about the cluster as a whole. We can try out
>>>> different cluster topologies. (Run Drill-bits on x of y nodes on a rack,
>>>> say.) And, we can eventually do the dynamic load management we discussed
>>>> earlier.
>>>> 
>>>> But first, I look forward to hearing what others have tried and what
>> we’ve
>>>> learned about how people want to use Drill in a production YARN cluster.
>>>> 
>>>> Thanks,
>>>> 
>>>> - Paul
>>>> 
>>>> 
>>>>> On Mar 22, 2016, at 5:45 PM, Jacques Nadeau <jacques@dremio.com>
>> wrote:
>>>>> 
>>>>> This is great news, welcome!
>>>>> 
>>>>> What are you thinking in regards to static versus dynamic resource
>>>>> allocation? We have some conversations going regarding workload
>>>> management
>>>>> but they are still early so it seems like starting with user-controlled
>>>>> allocation makes sense initially.
>>>>> 
>>>>> Also, have you spent much time evaluating whether one of the existing
>>>> YARN
>>>>> frameworks such as Slider would be useful? Does anyone on the list have
>>>> any
>>>>> feedback on the relative merits of these technologies?
>>>>> 
>>>>> Again, glad to see someone picking this up.
>>>>> 
>>>>> Jacques
>>>>> 
>>>>> 
>>>>> --
>>>>> Jacques Nadeau
>>>>> CTO and Co-Founder, Dremio
>>>>> 
>>>>> On Tue, Mar 22, 2016 at 4:58 PM, Paul Rogers <progers@maprtech.com>
>>>> wrote:
>>>>> 
>>>>>> Hi All,
>>>>>> 
>>>>>> I’m a new member of the Drill Team here at MapR. We’d like to
take a
>>>> look
>>>>>> at running Drill on YARN for production customers. JIRA suggests
some
>>>> early
>>>>>> work may have been done (DRILL-142 <
>>>>>> https://issues.apache.org/jira/browse/DRILL-142>, DRILL-1170 <
>>>>>> https://issues.apache.org/jira/browse/DRILL-1170>, DRILL-3675
<
>>>>>> https://issues.apache.org/jira/browse/DRILL-3675>).
>>>>>> 
>>>>>> YARN is a complex beast and the Drill community is large and growing.
>>>> So,
>>>>>> a good place to start is to ask if anyone has already done work on
>>>>>> integrating Drill with YARN (see DRILL-142)?  Or has thought about
>> what
>>>>>> might be needed?
>>>>>> 
>>>>>> DRILL-1170 (YARN support for Drill) seems a good place to gather
>>>>>> requirements, designs and so on. I’ve posted a “starter set”
of
>>>>>> requirements to spur discussion.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> - Paul


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message