mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Topping <brian.topp...@gmail.com>
Subject Re: Debugging hadoop-mesos
Date Fri, 08 May 2015 08:06:24 GMT
Indeed, this was all that was left to get jobs working, thanks!

Last thing I need to do for initial setup is get rid of the thousands of these messages, about
three or four per second. I'm running against 2.6.0-mr1-cdh5.4.0, maybe there was a change
to the API semantics.

> 2015-05-08 03:33:24,421 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited
TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:24,724 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited
TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:25,028 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited
TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:25,331 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited
TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:25,636 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited
TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:25,940 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited
TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:26,243 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited
TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:26,546 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited
TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:26,850 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited
TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:27,153 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited
TaskTracker: http://10.211.55.16:50060.
> 2015-05-08 03:33:27,456 INFO org.apache.hadoop.mapred.MesosScheduler: Unknown/exited
TaskTracker: http://10.211.55.16:50060.
> 
> On May 8, 2015, at 2:47 PM, haosdent <haosdent@gmail.com> wrote:
> 
> I think you could export HADOOP_LOG_DIR=/tmp to temp. And try again.
> 
> On Fri, May 8, 2015 at 3:43 PM, Brian Topping <brian.topping@gmail.com <mailto:brian.topping@gmail.com>>
wrote:
> Mesos runs as root, hadoop is as a separate user.
> 
>> On May 8, 2015, at 2:41 PM, haosdent <haosdent@gmail.com <mailto:haosdent@gmail.com>>
wrote:
>> 
>> You run everything in root?
>> 
>> On Fri, May 8, 2015 at 3:38 PM, haosdent <haosdent@gmail.com <mailto:haosdent@gmail.com>>
wrote:
>> Seems you don't have permission for this directory:
>> 
>> java.io.IOException: Could not create job user log directory: file:/usr/lib/hadoop/logs/userlogs/job_201505080220_0001
>> 
>> at javax.security.auth.Subject.doAs(Subject.java:415)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>> 
>> 
>> On Fri, May 8, 2015 at 3:32 PM, Brian Topping <brian.topping@gmail.com <mailto:brian.topping@gmail.com>>
wrote:
>> Thanks Hasodent, I've updated https://gist.github.com/briantopping/311960f8e5454dbe9aab
<https://gist.github.com/briantopping/311960f8e5454dbe9aab> with the output logs of
what I am currently seeing. I've edited them for size, the message "INFO org.apache.hadoop.mapred.MesosScheduler:
Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>"
appeared a few thousand times in the logs. The configuration I have is probably still broken,
50060 is a Jetty port that returns a Cloudera string when telnetting to it.
>> 
>> The error I saw below were apparently the result of building against the older version
of CDH, when I updated the hadoop-mesos POM to match my deployment version, the incorrectly
calculated "slots" problem in my previous message has resolved.
>> 
>> My current problem is a Hadoop logging problem and nothing to do with Mesos, so I
didn't post. I changed hadoop.log.dir=/var/log/hadoop in /etc/hadoop/conf.pseudo.mr1/log4j.properties,
but it didn't make any difference. Just getting back into it now.
>> 
>>> On May 8, 2015, at 1:56 PM, haosdent <haosdent@gmail.com <mailto:haosdent@gmail.com>>
wrote:
>>> 
>>> Could you post the log in executors which run jobtracker and taskstracks? It
would be helpful to find the cause of this problem.
>>> 
>>> On Fri, May 8, 2015 at 3:05 AM, Brian Topping <brian.topping@gmail.com <mailto:brian.topping@gmail.com>>
wrote:
>>> I think there's something weird here:
>>>>   cpus: offered 2.0 needed at least 1.0
>>>>   mem : offered 1724.0 needed at least 1024.0
>>>>   disk: offered 44124.0 needed at least 1024.0
>>>>   ports:  at least 2 (sufficient)
>>> 
>>> Am I misreading this? All of the requirements seem to be met.
>>> 
>>> Presumably it's this code from o.a.h.mapred.ResourcePolicyVariable:
>>> 
>>>> int slots = mapSlotsMax + reduceSlotsMax;
>>>> slots = (int) Math.min(slots, (cpus - containerCpus) / slotCpus);
>>>> slots = (int) Math.min(slots, (mem - containerMem) / slotMem);
>>>> slots = (int) Math.min(slots, (disk - containerDisk) / slotDisk);
>>>> 
>>>> // Is this offer too small for even the minimum slots?
>>>> if (slots < 1) {
>>>>   return false;
>>>> }
>>> 
>>> Not exactly sure what this is doing.
>>> 
>>> Sorry for the noise.
>>> 
>>>> 
>>>> On May 7, 2015, at 6:32 PM, Brian Topping <brian.topping@gmail.com <mailto:brian.topping@gmail.com>>
wrote:
>>>> 
>>>> Presumably https://gist.github.com/briantopping/311960f8e5454dbe9aab <https://gist.github.com/briantopping/311960f8e5454dbe9aab>
has some more information necessary at this point... sorry for the omission..
>>>> 
>>>>> On May 7, 2015, at 6:05 PM, Tom Arnfeld <tom@duedil.com <mailto:tom@duedil.com>>
wrote:
>>>>> 
>>>>> Hi Brian,
>>>>> 
>>>>> At this point you should see the TT attempting to be launched via Mesos.
The "launched but not heartbeat yet" count tells us that the framework has accepted resources
for 4 slots but the TT hasn't actually come up yet.
>>>>> 
>>>>> Do you see the task in your Meaos cluster UI, and is there anything interesting
in the task logs?
>>>>> 
>>>>> --
>>>>> 
>>>>> Tom Arnfeld
>>>>> Developer // DueDil
>>>>> 
>>>>> (+44) 7525940046 <tel:%28%2B44%29%207525940046>
>>>>> 25 Christopher Street, London, EC2A 2BS
>>>>> 
>>>>> 
>>>>> On Thu, May 7, 2015 at 12:01 PM, Brian Topping <brian.topping@gmail.com
<mailto:brian.topping@gmail.com>> wrote:
>>>>> 
>>>>> Thanks guys, this was helpful. I started the job tracker as a service,
but apparently I never started the task tracker (or it failed to start and I didn't notice).
I started it after Haosdent's message, but wasn't able to see any difference and I kept poking
around.
>>>>> 
>>>>> After making some changes and the VM wouldn't boot, my OCD got the better
of me and I reinstalled everything from scratch. There are just too many moving parts to hassle
you guys with an imperfect install on my end.
>>>>> 
>>>>> This time through, I felt a lot more confident to use the Mesosphere
RPMs, but I couldn't find the best way to get things launched. https://docs.mesosphere.com/reference/packages/
<https://docs.mesosphere.com/reference/packages/> has a Last-Modified of Fri, 01 May
2015 18:46:10 GMT (one week ago), but the RHEL 6 RPMs don't have any init.d service descriptions
as the packages page would indicate. For now, I just launched them manually, but would like
to get the machine to completely load on boot as services.
>>>>> 
>>>>> At this point, I have tested Mesos with:
>>>>> 
>>>>> 	mesos-execute --master="localhost:5050" --name="test-exec" --command="sleep
10"
>>>>> 
>>>>> The only problem there is it seems that "localhost" isn't good enough
for my install, it needs to be the FQDN, but it works and the job flows through the UI.
>>>>> 
>>>>> Now, back to a hadoop job. When I try the job now, the logs show the
following stream of repeated messages:
>>>>> 
>>>>>> 2015-05-07 17:52:53,124 INFO org.apache.hadoop.mapred.ResourcePolicy:
Satisfied map and reduce slots needed.
>>>>>> 2015-05-07 17:52:53,340 INFO org.apache.hadoop.mapred.MesosScheduler:
Unknown/exited TaskTracker: http://10.211.55.16:50060 <http://10.211.55.16:50060/>.
>>>>>> [Repeated a few times a second for five seconds]
>>>>>> 2015-05-07 17:49:08,914 INFO org.apache.hadoop.mapred.ResourcePolicy:
JobTracker Status
>>>>>>       Pending Map Tasks: 4
>>>>>>    Pending Reduce Tasks: 1
>>>>>>       Running Map Tasks: 0
>>>>>>    Running Reduce Tasks: 0
>>>>>>          Idle Map Slots: 0
>>>>>>       Idle Reduce Slots: 0
>>>>>>      Inactive Map Slots: 4 (launched but no hearbeat yet)
>>>>>>   Inactive Reduce Slots: 1 (launched but no hearbeat yet)
>>>>>>        Needed Map Slots: 0
>>>>>>     Needed Reduce Slots: 0
>>>>>>      Unhealthy Trackers: 0
>>>>> 
>>>>> This looks close.
>>>>> 
>>>>> What's the best way to get a JDWP port set up to break in this code (i.e.
learning to fish...)?
>>>>> 
>>>>> best, Brian
>>>>> 
>>>>> 
>>>>>> On May 7, 2015, at 12:11 PM, Adam Bordelon <adam@mesosphere.io
<mailto:adam@mesosphere.io>> wrote:
>>>>>> 
>>>>>> From the mesos-master log and the JT log, it doesn't look like the
MesosScheduler ever registered with Mesos, which should mean that it wouldn't start any TTs
or map/reduce tasks. However, your `ps` output does seem to show a tasktracker running. Did
you start that yourself (or automatically as a system service)?
>>>>>> 
>>>>>> On Wed, May 6, 2015 at 9:32 AM, haosdent <haosdent@gmail.com <mailto:haosdent@gmail.com>>
wrote:
>>>>>> Do you start tasktracker successfully?
>>>>>> 
>>>>>> On Wed, May 6, 2015 at 11:32 PM, Brian Topping <brian.topping@gmail.com
<mailto:brian.topping@gmail.com>> wrote:
>>>>>> Hi all, I'm happy to report that I'm very close to getting 2.6.0-cdh5.4.0
integrated against Mesos 0.22.1 with the hadoop-mesos 0.10 code on Github. Hoping someone
might have a few minutes to parse what I've got here and suggest something to try.
>>>>>> 
>>>>>> https://gist.github.com/briantopping/0dfd0777ff4ce5a81219 <https://gist.github.com/briantopping/0dfd0777ff4ce5a81219>
hopefully has all the data necessary between the console output of the client run, the mesos
master and slave console, the XML configuration of the JT and the output that was generated
by it. Please let me know if I've left something out.
>>>>>> 
>>>>>> I iterated a few times getting all the errors from missing paths
or libraries sorted out, but the example client ultimately just sits waiting forever at "map
0% reduce 0%".
>>>>>> 
>>>>>> Any input kindly appreciated!
>>>>>> 
>>>>>> Brian
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Best Regards,
>>>>>> Haosdent Huang
>>>>>> 
>>>>> 
>>>>> <signature.asc>
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Best Regards,
>>> Haosdent Huang
>> 
>> 
>> 
>> 
>> --
>> Best Regards,
>> Haosdent Huang
>> 
>> 
>> 
>> --
>> Best Regards,
>> Haosdent Huang
> 
> 
> 
> 
> --
> Best Regards,
> Haosdent Huang


Mime
View raw message