hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krzysztof Zarzycki <k.zarzy...@gmail.com>
Subject Re: YARN timelineserver process taking 600% CPU
Date Tue, 08 Dec 2015 07:10:05 GMT
Hi, thanks for joining in. I was working on defaults , which means in HDP
2.3:
yarn.timeline-service.ttl-ms: 2678400000  (31 days)
yarn.timeline-service.client.retry-interval-ms: 300000

But, I worked around the problem!
I suspected that Hortonworks had decided to include this unfinished yet
patch : https://issues.apache.org/jira/browse/SPARK-1537 in their HDP 2.3
distribution. My suspicion came from this thread:
http://markmail.org/message/w2z2foygzizlvnm4, as well as from the setting
in Spark in my HDP distribution :
spark.yarn.services: org.apache.spark.deploy.yarn.history.YarnHistoryService

I decided to disable the service. I did't see a way to disable, so I just
pointed it to SomeUnexisting class and just ignore a warning on Spark job
start.  Then I restarted my Spark jobs, and the load is gone! Well, spark
history in YARN too :/ But it just seems that this feature is not yet
production-ready . Or at least badly configured or sth.

Anyway, this is a workaround I found for people that encounter similar
problems.

Thanks for all of you trying to help me. If you have any opinions about it,
please share.
Cheers,
Krzysztof







2015-12-06 8:14 GMT+01:00 郭士伟 <guoshiwei@gmail.com>:

> It seems that it's the large leveldb size that cause the problem. What is
> the value of 'yarn.timeline-service.ttl-ms' config ? Maybe it's not short
> enough so we have too much entities in timeline store.
> And by the way, it will take a long time (hours) when the ATS do discard
> old entity operation, and it will also block the other operations. The
> patch https://issues.apache.org/jira/browse/YARN-3448 is a great
> performance improve. We just backport it and it works well.
>
> 2015-11-06 13:07 GMT+08:00 Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com>:
>
>> Hi Krzysiek,
>>
>>
>>
>> *There are currently 8 Spark Streaming jobs constantly running, each 3
>> with 1 second batch, 5 x 10 s. I believe these are the jobs that publish to
>> ATS.  How could I check what precisely is doing what or how to get some
>> logs about it, I don't know...*
>>
>> Not sure about the applications being run and if you have already tried
>> disabling the "Spark History Server doing the puts ATS" then not sure if
>> the apps are sending it out . AFAIK Spark history server had not integrated
>> with ATS (SPARK-1537). So most propably its the applications which are
>> pumping in the data. I think you need to check with them itself.
>>
>>
>> *2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load? *
>>
>> Its not been used in large scale by us but according YARN-2556 (ATS
>> Performance Test Tool), it states that "On a 36 node cluster, this
>> results in ~830 concurrent containers (e.g maps), each firing 10KB of
>> payload, 20 times." but only thing being different is, data in your
>> system is already overloaded hence cost of querying (which is currently
>> happening during each insertion) is very high.
>>
>> May be guys from other company who have used or supported ATSV1 might be
>> able to tell the ATSV1 scale better !
>>
>>
>> Regards,
>>
>> + Naga
>> ------------------------------
>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>> *Sent:* Thursday, November 05, 2015 19:51
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: YARN timelineserver process taking 600% CPU
>>
>> Thanks Naga for your input,  (I'm sorry for a late response, I was out
>> for some time).
>>
>> So you believe that Spark is actually doing the PUTs? There are currently
>> 8 Spark Streaming jobs constantly running, each 3 with 1 second batch, 5 x
>> 10 s. I believe these are the jobs that publish to ATS.  How could I check
>> what precisely is doing what or how to get some logs about it, I don't
>> know...
>> I though maybe it is Spark History Server doing the puts, but it seems it
>> is not, as I disabled it and the load hasn't gone down. So it seems these
>> are the jobs itself indeed.
>>
>> Now I have the following problems:
>> 1. The most important: How can I at least *workaround* this issue? Maybe
>> I will somehow disable Spark usage of Yarn timelineserver ? What are the
>> consequences? Is it only history of Spark finished jobs not being saved? If
>> yes, that doesn't hurt that much. Probably this is a question to Spark
>> group...
>> 2. Is 8 concurrent Spark Streaming jobs really that high for
>> Timelineserver? I have just a small cluster, how other larger companies are
>> handling much larger load?
>>
>> Thanks for helping me with this!
>> Krzysiek
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <
>> naganarasimha.gr@gmail.com>:
>>
>>> Hi Krzysiek,
>>> Oops My mistake, 3 Gb seems to be on little higher side.
>>> And from the jstack it seems like there were no major activity other
>>> than puts seems like around 16 concurrent puts were happening which tries
>>> to get the timeline Entity hence hitting the native call.
>>>
>>> From the logs it seems like lot of ACL validations are happening and
>>> from the URL it seems like its for PUTEntites.
>>> approximately from 09:30:16 to 09:44:26 about 9213 checks have happened
>>> and if all of these are for puts then roughly about 10 put calls/s is
>>> happening from *spark* side. This i feel is not right usage of ATS, can
>>> you check what is being published from the spark to ATS at this high rate ?
>>>
>>> Besides some improvements regarding the timeline metrics is available in
>>> trunk as part of YARN-3360 which could have been useful in analyzing your
>>> issue.
>>>
>>> + Naga
>>>
>>>
>>> On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k.zarzycki@gmail.com
>>> > wrote:
>>>
>>>> Hi Naga,
>>>> Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows
>>>> numbers in kB). Does that seems reasonable as well?
>>>> There are new .sst files generated each minute.
>>>> There are now 26850 files in leveldb-timeline-store directory. New
>>>> files are generated each minute. Some are also being deleted.
>>>>
>>>> I started timeline server today, to gather logs and jstack, it was
>>>> running for ~20 minutes. I attach the tar bz2 archive with those logs.
>>>>
>>>> Thank you for helping me debug this.
>>>> Krzysiek
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <
>>>> naganarasimha.gr@gmail.com>:
>>>>
>>>>> Hi Krzysiek,
>>>>> seems like the size is around 3 MB which seems to be fine. ,
>>>>> Could you try enabling in debug and share the logs of ATS/AHS and also
>>>>> if possible the jstack output for the AHS process
>>>>>
>>>>> + Naga
>>>>>
>>>>> On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <
>>>>> k.zarzycki@gmail.com> wrote:
>>>>>
>>>>>> Hi Naga,
>>>>>> I see the following size:
>>>>>> $ sudo du --max=1 /var/lib/hadoop/yarn/timeline
>>>>>> 36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
>>>>>> 3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
>>>>>> 3307812 /var/lib/hadoop/yarn/timeline
>>>>>>
>>>>>> The timeline service has been multiple times restarted as I was
>>>>>> looking for issue with it. But it was installed about a 2 months
ago. Just
>>>>>> few applications (1?2? ) has been started since its last start. The
>>>>>> ResourceManager interface has 261 entries.
>>>>>>
>>>>>> As in yarn-site.xml that I attached, the variable you're asking for
>>>>>> has the following value:
>>>>>>
>>>>>> <property>
>>>>>>
>>>>>>   <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
>>>>>>       <value>300000</value>
>>>>>> </property>
>>>>>>
>>>>>>
>>>>>> Ah, One more thing: When I looked with jstack to see what the process
>>>>>> is doing, I saw threads spending time in NATIVE in leveldbjni library.
So I
>>>>>> *think* it is related to leveldb store.
>>>>>>
>>>>>> Please ask if any more information is needed.
>>>>>> Any help is appreciated! Thanks
>>>>>> Krzysiek
>>>>>>
>>>>>> 2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <
>>>>>> garlanaganarasimha@huawei.com>:
>>>>>>
>>>>>>> Hi ,
>>>>>>>
>>>>>>> Whats the size of Store Files?
>>>>>>> Since when is it running ? how many applications have been run
since
>>>>>>> it has been started ?
>>>>>>> Whats the value of "
>>>>>>> yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms"
?
>>>>>>>
>>>>>>> + Naga
>>>>>>> ------------------------------
>>>>>>> *From:* Krzysztof Zarzycki [k.zarzycki@gmail.com]
>>>>>>> *Sent:* Wednesday, September 30, 2015 19:20
>>>>>>> *To:* user@hadoop.apache.org
>>>>>>> *Subject:* YARN timelineserver process taking 600% CPU
>>>>>>>
>>>>>>> Hi there Hadoopers,
>>>>>>> I have a serious issue with my installation of Hadoop & YARN
in
>>>>>>> version 2.7.1 (HDP 2.3).
>>>>>>> The timelineserver process ( more
>>>>>>> precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>> class) takes over 600% of CPU, generating enormous load on my
master node.
>>>>>>> I can't guess why it happens.
>>>>>>>
>>>>>>> First, I run the timelineserver using java 8, thought that this
was
>>>>>>> an issue. But no, I started timelineserver now with use of java
7 and still
>>>>>>> the problem is the same.
>>>>>>>
>>>>>>> My cluster is tiny- it consists of:
>>>>>>> - 2 HDFS nodes
>>>>>>> - 2 HBase RegionServers
>>>>>>> - 2 Kafkas
>>>>>>> - 2 Spark nodes
>>>>>>> - 8 Spark Streaming jobs, processing around 100 messages/second
>>>>>>> TOTAL.
>>>>>>>
>>>>>>> I'll be very grateful for your help here. If you need any more
info,
>>>>>>> please write.
>>>>>>> I also attach yarn-site.xml grepped to options related to timeline
>>>>>>> server.
>>>>>>>
>>>>>>> And here is a command of timeline that I see from ps :
>>>>>>> /usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m
>>>>>>> -Dhdp.version=2.3.0.0-2557 -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir=
>>>>>>> -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,EWMA,RFA
>>>>>>> -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -Dyarn.policy.file=hadoop-policy.xml
>>>>>>> -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
>>>>>>> -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
>>>>>>> -Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver
>>>>>>> -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
>>>>>>> -Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA
>>>>>>> -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
>>>>>>> -classpath
>>>>>>> /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
>>>>>>> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Krzysztof
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message