drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <prog...@maprtech.com>
Subject Re: Drill on YARN
Date Sun, 27 Mar 2016 20:25:30 GMT
Hi John,

The other main topic of your discussion is memory management. Here we seem to have 6 topics:

1. Setting the limits for Drill.
2. Drill respects the limits.
3. Drill lives within its memory “budget.”
4. Drill throttles work based on available memory.
5. Drill adapts memory usage to available memory.
6. Some means to inform Drill of increases (or decreased) in memory allocation.

YARN, via container requests, solves the first problem. Someone (the network admin) has to
decide on the size of each drill-bit container, but YARN handles allocating the space, preventing
memory oversubscription, and enforcing the limit (by killing processes that exceed their allocation.)

As you pointed out, memory management is different than CPU: we can’t just expect Linux
to silently give us more or less depending on load. Instead, Drill itself has to actively
request and release memory (and know what to do in each case.)

Item 2 says that Drill must limit its memory use. The JVM enforces heap size. (As the heap
is exhausted, the a Java program gets slower due to increased garbage collection events until
finally it receives an out-of-memory error.

At present I’m still learning the details of how Drill manages memory so, by necessity,
most of what follows is at the level of “what we could do” rather than “how it works
today.” Drill devs, please help fill in the gaps.

The docs. suggest we have a variety of settings that configure drill memory (heap size, off-heap
size, etc.) I need to ask around more to learn if Drill does, in fact, limit its off-heap
memory usage. If not, then perhaps this is a change we want to make.

Once Drill respects memory limits, we move to item 3: Drill should live within the limits.
By this I mean that query operations should work with constrained memory, perhaps by spilling
to disk — it is not sufficient to simply fail when memory is exhausted. Again, I don’t
yet know where we’re at here, but I understand we may still have a bit of work to do to
achieve this goal.

Item 4 looks at the larger picture. Suppose a Drill-bit has 32GB of memory available to it.
We do the work needed so that any given query can succeed within this limit (perhaps slowly
if operations spill to disk.) But, what happens when the same Drill-bit now has to process
10 such queries or 100? We now have a much harder problem: having the collection of ALL queries
live within the same 32GB limit.

One solution is to simply hold queries in a queue when memory (or even CPU) becomes impacted.
That is, rather than trying to run all 100 queries at once (slowly), perhaps run 20 at a time
(quickly), allowing each a much larger share of memory.

Drill already has queues, but they are off by default. We may have to look at turning them
on by default. Again, I’m not familiar with our queuing strategy, but there seems quite
a bit we could do to release queries from the queue only when we can give them adequate resources
on each drill-bit.

Item 5 says that Drill should be opportunistic. If some external system can grant a temporary
loan of more memory, Drill should be able to use it. When the loan is revoked, Drill should
relinquish the memory, perhaps by spilling the data to disk (or moving the data to other parts
of memory.) Java programs can’t release heap memory, but Drill uses off-heap, so it is at
least theoretically possible to release memory back to the OS. Sounds like Drill has a number
of improvements needed before Drill can actually release off-heap memory.

Finally, item 6 says we need that external system to loan Drill the extra memory. With CPU,
the process scheduler can solve the problem all on its own by looking at system load and deciding,
at any instant, which processes to run. Memory is harder.

One solution would be for YARN to resize Drill’s container. But, YARN does not yet support
resizing containers. YARN-1197: "Support changing resources of an allocated container" [1]
describes the tasks needed to get there. Once that feature is complete, YARN will let an application
ask for more memory (or release excess memory). Presumably the app or a user must decide to
request more memory. For example, the admin might dial up Drill memory during the day when
the marketing folks are running queries, but dial it back at night when mostly batch jobs

The ability to manually change memory is great, but the ideal would be have some automated
way to use free memory on each node. Llama does this in an ad-hoc manner. A quick search on
YARN did not reveal anything in this vein, so we’ll have to research this idea a bit more.
I wonder, though, if Drill could actually handle fast-moving allocation changes; change on
the order of the lifetime of a query seems more achievable (that is, on the order of minutes
to hours).

In short, it seems we have quite a few tasks ahead in the area of memory management. Each
seems achievable, but each requires work. The Drill-on-YARN project is just a start: it helps
the admin allocate memory between Drill and other apps.


- Paul 

[1] https://issues.apache.org/jira/browse/YARN-1197 <https://issues.apache.org/jira/browse/YARN-1197>

> On Mar 26, 2016, at 6:48 AM, John Omernik <john@omernik.com> wrote:
> Paul  -
> Great write-up.
> Your description of Llama and Yarn is both informative and troubling for a
> potential cluster administrator. Looking at this solution, it would appear
> that to use Yarn with Llama, the "citizen" in this case Drill would have to
> be an extremely good citizen and honor all requests from Llama related to
> deallocation and limits on resources while in reality there is no
> enforcement mechanisms.  Not that I don't think Drill is a great tool
> written well by great people, but I don't know if I would want to leave my
> cluster SLAs up to Drill bits doing the self regulation.  Edge cases, etc
> causing a Drillbit to start taking more resources would be very impactful
> to a cluster, and with more and more people going to highly concurrent,
> multi-tenant solutions, this becomes a HUGE challenge.
> Obviously dynamic allocation, flexing up and down to use "spare" cluster
> resources is very important to many cluster/architecture administrators,
> but if I had to guess, SLAs/Workload guarantees would rank higher.
> The Llama approach seems to be to much of a  "house of cards" to me to be
> viable, and I worry that long term it may not be best for a product like
> Drill. Our goal I think should be to play nice with others, if our core
> philosophy in integration is playing nice with others, it will only help
> adoption and people giving it a try.  So back to Drill on Yarn (natively)...
> A few questions around this.  You mention that resource allocations are
> mostly a gentlemen's agreement. Can you explore that a bit more?  I do
> believe there is Cgroup support in Yarn.  (I know the Myriad project is
> looking to use Cgroups).  So is this gentlemen's agreement more about when
> Cgroups is NOT enabled?  Thus it is only the word of the the process
> running in the container in Yarn?  If this is the case, then has there been
> any research on the stability of Cgroups and the implementation in Yarn?
> Basically, Poll: Are you using Yarn? If so are you using Cgroups? If not,
> why? If you are using them, any issues?   This may be helpful in what we
> are looking to do with Drill.
> "Hanifi’s work will allow us to increase or decrease the number of cores we
> consume."  Do you have any JIRAs I can follow on this, I am very interested
> in this. One of the benefits of CGroups in Mesos as it relates to CPU
> shares is a sorta built in Dynamic allocation. And it would be interesting
> to test a Yarn Cluster with Cgroups enabled (once a basic Yarn Aware Drill
> bit is enabled) to see if Yarn reacts the same way.
> Basically, when I run a drillbit on a node with Cgroup isolation enabled in
> Marathon on Mesos, lets say I have 16 total cores on the node. For me, I
> run my Mesos-Agent  with "14" available Vcores... Why? Static allocation of
> 2 vcores for MapR-FS.  Those 14 vcores are now available to tasks on the
> agent.  When I start the drillbit, let's say I allocate 8 vcores to the
> drillbit in Marathon.  Drill runs queries, and let's say the actual CPU
> usage on this node is minimal at the time, Drill, because it is not
> currently CPU aware, takes all the CPU it can. (it will use all 16 cores).
> Query finishes it goes back to 0. But what happens if MapR is heavily using
> it's 2 cores? Well , Cgroups detects contention and limits Drill because
> it's only allocated 8 shares of those 14 it's aware of, this gives priority
> to the MapR operations. Even more so, if there are other Mesos tasks asking
> for CPU shares, Drill's CPU share is being scaled back, not by telling
> Drill it can't use core, but by processing what Drill is trying to do
> slower compared to the rest of the work loads.   I am know I am dumbing
> this down, but that's how I understand Cgroups working. Basically, I was
> very concerned when I first started doing Drill queries in Mesos, and I
> posted to the Mesos list to which some people smarter than I took the time
> to explain things. (Vinode, you are lurking on this list, thanks again!)
> In a way, this is actually a nice side effect of Cgroup Isolation, from
> Drill's perspective it gets all the CPU, and is only scaled back on
> contention.  So, my long explanation here is to bring things back to the
> Yarn/Cgroup/Gentlemen's agreement comment. I'd really want to understand
> this. As a cluster administrator, I can guarantee a level of resources with
> Mesos, Can I get that same guarantee in Yarn? Is it only with certain
> settings?  I just want to be 100% clear that if we go the route, and make
> Drill work on Yarn, that in our documentation/instructions we are explicit
> in what we are giving the user on Yarn.  To me, a bad situation would occur
> when someone thinks all will be well when they run Drill on Yarn, and
> because they are not aware of their own settings (say not enabling Cgroups)
> They blame Drill for breaking something.
> So that falls back to Memory and scaling memory in Drill.  Memory for
> obvious reason can't operate like CPU with Cgroups. You can't allocated all
> memory to all the things, and then scale back if contention. So being a
> complete neophyte on the inner workings of Drill memory. What options would
> exist for allocation of memory.  Could we trigger events that would
> allocate up and down what memory a given drillbit can use so it self
> limits? It currently self limits because we can see the memory settings in
> drill-env.sh.  But what about changing that at a later time?   Is it easier
> to change direct memory limits rather than Heap?
> Hypothesis 1:  If Direct memory isn't allocated (i.e. a drillbit is idle)
> then setting what it could POTENTIALLY use if a query would come in would
> be easier then actually deallocating Heap that's been used. Hypothesis 2:
> Direct Memory, if it's truly deallocated when not in use is more about the
> limit of what it could use not about allocated or deallocating memory.
> Hence a nice step one may be to allow this to change as needed by an
> Application Master in Yarn (or a Mesos Framework)
> If the work to change the limit on Direct Memory usage was easier, it may
> be a good first step, (assuming I am not completely wrong on memory
> allocation) if we have to statically allocate Heap, and it's a big change
> in code to make that dynamic, but Direct Memory is easy to change, that's a
> great first feature, without boiling the ocean. Obviously lots of
> assumptions here, but I am just thinking outloud.
> Paul - When it comes to Application in Yarn, and the containers that the
> Application Master allocates, can containers be joined?  Let's say I am an
> application master, and I allocated 4 CPU Cores and 16 GB of ram to a
> Drillbit. (8 for Heap and 8 for Direct) .  Then at a later time I can add
> more memory to the drill bit.... If my assumptions worked on Direct memory
> in Drill, could my Application master tell the drill bit, ok you can use 16
> GB of direct memory now (i.e. the AM asks the RM to allocate 8 more GB of
> ram on that node, the RM agrees, and allocates another container, can it
> just resize, or would that not work? I guess what I am describing here is
> sorta what Llama is doing... but I am actually talking about the ability to
> enforce the quotas.... This may actually be a question that fits into your
> discussion on resizing Yarn containers more than anything.
> So i just tossed out a bunch of ideas here to keep discussion running.
> Drill Devs, I would love a better understanding of the memory allocation
> mechanisms within Drill. (High level, neophyte here).  I do feel as a
> cluster admin, as I have said, that the Llama approach (now that I
> understand it better) would worry me, especially in a multi-tenant
> cluster.  And as you said Paul, it "feels" hacky.
> Thanks for this discussion, it's a great opportunity for Drill adoption as
> clusters go more and more multi-tenant/multi-use.
> John
> On Fri, Mar 25, 2016 at 5:45 PM, Paul Rogers <progers@maprtech.com <mailto:progers@maprtech.com>>
>> Hi Jacques,
>> Llama is a very investing approach; I read their paper [1] early on; just
>> went back and read it again. Basically, Llama (as best as I can tell) has a
>> two-part solution.
>> First, Impala is run off-YARN (that is, not in a YARN container). Llama
>> uses “dummy” containers to inform YARN of Impala’s resource usage. They can
>> grow/shrink static allocations by launching more dummy containers. Each
>> dummy container does nothing other than inform off-YARN Impala of the
>> container resources. Rather clever, actually, even if it “abuses the
>> software” a bit.
>> Secondly, Llama is able to dynamically grab spare YARN resources on each
>> node. Specifically, Llama runs a Node Manager (NM) plugin that watches
>> actual node usage. The plugin detects the free NM resources and informs
>> Impala of them. Impala then consumes the resources as needed. When the NM
>> allocates a new container, the plugin informs Impala which relinquishes the
>> resources. All this works because YARN allocations are mostly a gentleman’s
>> agreement. Again, this is pretty clever, but only one app per node can play
>> this game.
>> The Llama approach could work for Drill. The benefit is that Drill runs as
>> it does today. Hanifi’s work will allow us to increase or decrease the
>> number of cores we consume. The draw-back is that Drill is not yet ready to
>> play the memory game: it can’t release memory back to the OS when
>> requested. Plus, the approach just smells like a hack.
>> The “pure-YARN” approach would be to let YARN start/stop the Drill-bits.
>> The user can grow/shrink Drill resources by starting/stopping Drill-bits.
>> (This is simple to do if one ignores data locality and starts each
>> Drill-bit on a separate node. It is a bit more work if one wants to
>> preserve data locality by being rack-aware, or by running multiple
>> drill-bits per node.)
>> YARN has been working on the ability to resize running containers. (See
>> YARN-1197 - Support changing resources of an allocated container [2]) Once
>> that is available, we can grow/shrink existing Drill-bits (assuming that
>> Drill itself is enhanced as discussed above.) The promise of resizable
>> containers also suggests that the “pure-YARN” approach is workable.
>> Once resizable containers are available, one more piece is needed to let
>> Drill use free resources. Some cluster-wide component must detect free
>> resources and offer them to applications that want them, deciding how to
>> divvy up the resources between, say, Drill and Impala. The same piece would
>> revoke resources when paying YARN customers need them.
>> Of course, if the resizable container feature come too late, or does not
>> work well, we still have the option of going off-YARN using the Llama
>> trick. But the Llama trick does nothing to do the cluster-wide coordination
>> discussed above.
>> So, the thought is: start simple with a “stock” YARN app. Then, we can add
>> bells and whistles as we gain experience and as YARN offers more
>> capabilities.
>> The nice thing about this approach is that the same idea plays well with
>> Mesos (though the implementation is different).
>> Thanks,
>> - Paul
>> [1] http://cloudera.github.io/llama/ <http://cloudera.github.io/llama/> <http://cloudera.github.io/llama/
>> [2] https://issues.apache.org/jira/browse/YARN-1197 <https://issues.apache.org/jira/browse/YARN-1197>
>> https://issues.apache.org/jira/browse/YARN-1197 <https://issues.apache.org/jira/browse/YARN-1197>>
>>> On Mar 24, 2016, at 2:34 PM, Jacques Nadeau <jacques@dremio.com <mailto:jacques@dremio.com>>
>>> Your proposed allocation approach makes a lot of sense. I think it will
>>> solve a large number of use cases. Thanks for giving an overview of the
>>> different frameworks. I wonder if they got too focused on the simple use
>>> case....
>>> Have you looked at LLama to see whether we could extend it for our needs?
>>> Its Apache licensed and probably has at least a start at a bunch of
>> things
>>> we're trying to do.
>>> https://github.com/cloudera/llama <https://github.com/cloudera/llama>
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>> On Tue, Mar 22, 2016 at 7:42 PM, Paul Rogers <progers@maprtech.com>
>> wrote:
>>>> Hi Jacques,
>>>> I’m thinking of “semi-static” allocation at first. Spin up a cluster
>>>> Drill-bits, after which the user can add or remove nodes while the
>> cluster
>>>> runs. (The add part is easy, the remove part is a bit tricky since we
>> don’t
>>>> yet have a way to gracefully shut down a Drill-bit.) Once we get the
>> basics
>>>> to work, we can incrementally try out dynamics. For example, someone
>> could
>>>> whip up a script to look at load and use the proposed YARN client app to
>>>> adjust resources. Later, we can fold dynamic load management into the
>>>> solution once we’re sure what folks want.
>>>> I did look at Slider, Twill, Kitten and REEF. Kitten is too basic. I had
>>>> great hope for Slider. But, it turns out that Slider and Weave have each
>>>> built an elaborate framework to isolate us from YARN. The Slider
>> framework
>>>> (written in Python) seems harder to understand than YARN itself. At
>> least,
>>>> one has to be an expert in YARN to understand what all that Python code
>>>> does. And, just looking at the class count in the Twill Javadoc was
>>>> overwhelming. Slider and Twill have to solve the general case. If we
>> build
>>>> our own Java solution, we only have to solve the Drill case, which is
>>>> likely much simpler.
>>>> A bespoke solution would seem to offer some other advantages. It lets us
>>>> do things like integrate ZK monitoring so we can learn of zombie drill
>> bits
>>>> (haven’t exited, but not sending heartbeat messages.) We can also gather
>>>> metrics and historical data about the cluster as a whole. We can try out
>>>> different cluster topologies. (Run Drill-bits on x of y nodes on a rack,
>>>> say.) And, we can eventually do the dynamic load management we discussed
>>>> earlier.
>>>> But first, I look forward to hearing what others have tried and what
>> we’ve
>>>> learned about how people want to use Drill in a production YARN cluster.
>>>> Thanks,
>>>> - Paul
>>>>> On Mar 22, 2016, at 5:45 PM, Jacques Nadeau <jacques@dremio.com>
>> wrote:
>>>>> This is great news, welcome!
>>>>> What are you thinking in regards to static versus dynamic resource
>>>>> allocation? We have some conversations going regarding workload
>>>> management
>>>>> but they are still early so it seems like starting with user-controlled
>>>>> allocation makes sense initially.
>>>>> Also, have you spent much time evaluating whether one of the existing
>>>> YARN
>>>>> frameworks such as Slider would be useful? Does anyone on the list have
>>>> any
>>>>> feedback on the relative merits of these technologies?
>>>>> Again, glad to see someone picking this up.
>>>>> Jacques
>>>>> --
>>>>> Jacques Nadeau
>>>>> CTO and Co-Founder, Dremio
>>>>> On Tue, Mar 22, 2016 at 4:58 PM, Paul Rogers <progers@maprtech.com>
>>>> wrote:
>>>>>> Hi All,
>>>>>> I’m a new member of the Drill Team here at MapR. We’d like to
take a
>>>> look
>>>>>> at running Drill on YARN for production customers. JIRA suggests
>>>> early
>>>>>> work may have been done (DRILL-142 <
>>>>>> https://issues.apache.org/jira/browse/DRILL-142>, DRILL-1170 <
>>>>>> https://issues.apache.org/jira/browse/DRILL-1170>, DRILL-3675
>>>>>> https://issues.apache.org/jira/browse/DRILL-3675>).
>>>>>> YARN is a complex beast and the Drill community is large and growing.
>>>> So,
>>>>>> a good place to start is to ask if anyone has already done work on
>>>>>> integrating Drill with YARN (see DRILL-142)?  Or has thought about
>> what
>>>>>> might be needed?
>>>>>> DRILL-1170 (YARN support for Drill) seems a good place to gather
>>>>>> requirements, designs and so on. I’ve posted a “starter set”
>>>>>> requirements to spur discussion.
>>>>>> Thanks,
>>>>>> - Paul

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message