mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Ryan <jur...@ziprealty.com>
Subject Re: Disappearing tasks
Date Wed, 13 Apr 2016 19:29:35 GMT
Hiya, coming back to this thread after having to focus on some other things (and facing some
issues I brought up in another thread).

I reconfigured this cluster with work_dir as /var/mesos and am logging output from ‘mesos
ps’ from the python mesos.cli package in a loop to try and catch the next occurrence.

Still, what seems most interesting to me is that the count of “Running” remembers the
lost processes.  Even now, as I’ve launched 3 new instances of flume from marathon, the
running count is 6.  Killed count shows recently killed tasks, but was at 0 earlier when I
had 3 processes running which mesos had lost.


From: Greg Mann <greg@mesosphere.io<mailto:greg@mesosphere.io>>
Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" <user@mesos.apache.org<mailto:user@mesos.apache.org>>
Date: Wednesday, April 6, 2016 at 4:24 PM
To: user <user@mesos.apache.org<mailto:user@mesos.apache.org>>
Subject: Re: Disappearing tasks

Hi Justin,
I'm sorry that you've been having difficulty with your cluster. Do you have access to master/agent
logs around the time that these tasks went missing from the Mesos UI? It would be great to
have a look at those if possible.

I would still recommend against setting the agent work_dir to '/tmp/mesos' for a long-running
cluster scenario - this location is really only suitable for local, short-term testing purposes.
We currently have a patch in flight to update our docs to clarify this point. Even though
the work_dir appeared to be intact when you checked it, it's possible that some of the agent's
checkpoint data had been deleted. Could you try changing the work_dir for your agents to see
if that helps?

Cheers,
Greg


On Wed, Apr 6, 2016 at 11:27 AM, Justin Ryan <juryan@ziprealty.com<mailto:juryan@ziprealty.com>>
wrote:
Thanks Rik – Interesting theory, I considered that it might have some connection to the
removal of sandbox files.

Sooo this morning I had all of my kafka brokers disappear again, and checked this on a node
that is definitely still running kafka.  All of /tmp/mesos, including what appear to be the
sandbox and logs of the running process, are still there, and the “running” count this
time is actually higher than I’d expect.  I had 9 kafka brokers and 3 flume processes running,
and the running count currently says 15.

From: <rik.wasmus@takeaway.com<mailto:rik.wasmus@takeaway.com>> on behalf of Rik
<rik@grib.nl<mailto:rik@grib.nl>>
Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" <user@mesos.apache.org<mailto:user@mesos.apache.org>>
Date: Tuesday, April 5, 2016 at 3:19 PM
To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" <user@mesos.apache.org<mailto:user@mesos.apache.org>>
Subject: Re: Disappearing tasks

FWIW, the only time I've seen this happen here is when someone accidentally clears the work
dir (default=/tmp/mesos), which I personally would advise to put somewhere else where rogue
people or processes are less likely to throw things away accidentally. Could it be that? Although...
tasks were 'lost' at that point, so it differs slightly (same general outcome, not entirely
the same symptoms).

On Tue, Apr 5, 2016 at 11:35 PM, Justin Ryan <juryan@ziprealty.com<mailto:juryan@ziprealty.com>>
wrote:
An interesting fact I left out, the count of “Running” tasks remains intact, while absolutely
no history remains in the dashboard.



From: Justin Ryan <juryan@ziprealty.com<mailto:juryan@ziprealty.com>>
Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" <user@mesos.apache.org<mailto:user@mesos.apache.org>>
Date: Tuesday, April 5, 2016 at 12:29 PM
To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" <user@mesos.apache.org<mailto:user@mesos.apache.org>>
Subject: Disappearing tasks

Hiya folks!

I’ve spent the past few weeks prototyping a new data cluster with Mesos, Kafka, and Flume
delivering data to HDFS which we plan to interact with via Spark.  In the prototype environment,
I had a fairly high volume of test data flowing for some weeks with little to no major issues
except for learning about tuning Kafka and Flume.

I’m launching kafka with the github.com/mesos/kafka<http://github.com/mesos/kafka>
project, and flume is run via marathon.

Yesterday morning, I came in and my flume jobs had disappeared from the task list in Mesos,
though I found the actual processes still running when I searched the cluster ’ps’ output.
 Later in the day, I had the same happen to my kafka brokers.  In some cases, the only way
I’ve found to recover from this is to shut everything down and clear the zookeeper data,
which would be fairly drastic if it happened in production, and particularly if we had many
tasks / frameworks that were fine, but one or two disappeared.

I’d appreciate any help sorting through this, I’m using latest Mesos and CDH5 installed
via community Chef cookbooks.


________________________________

P Please consider the environment before printing this e-mail

The information in this electronic mail message is the sender's confidential business and
may be legally privileged. It is intended solely for the addressee(s). Access to this internet
electronic mail message by anyone else is unauthorized. If you are not the intended recipient,
any disclosure, copying, distribution or any action taken or omitted to be taken in reliance
on it is prohibited and may be unlawful. The sender believes that this E-mail and any attachments
were free of any virus, worm, Trojan horse, and/or malicious code when sent. This message
and its attachments could have been infected during transmission. By reading the message and
opening any attachments, the recipient accepts full responsibility for taking protective and
remedial action about viruses and other defects. The sender's employer is not liable for any
loss or damage arising in any way.

Mime
View raw message