hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Naganarasimha G R (Naga)" <garlanaganarasi...@huawei.com>
Subject RE: YARN timelineserver process taking 600% CPU
Date Fri, 06 Nov 2015 05:07:28 GMT
Hi Krzysiek,



There are currently 8 Spark Streaming jobs constantly running, each 3 with 1 second batch,
5 x 10 s. I believe these are the jobs that publish to ATS.  How could I check what precisely
is doing what or how to get some logs about it, I don't know...

Not sure about the applications being run and if you have already tried disabling the "Spark
History Server doing the puts ATS" then not sure if the apps are sending it out . AFAIK Spark
history server had not integrated with ATS (SPARK-1537). So most propably its the applications
which are pumping in the data. I think you need to check with them itself.


2. Is 8 concurrent Spark Streaming jobs really that high for Timelineserver? I have just a
small cluster, how other larger companies are handling much larger load?

Its not been used in large scale by us but according YARN-2556 (ATS Performance Test Tool),
it states that "On a 36 node cluster, this results in ~830 concurrent containers (e.g maps),
each firing 10KB of payload, 20 times." but only thing being different is, data in your system
is already overloaded hence cost of querying (which is currently happening during each insertion)
is very high.

May be guys from other company who have used or supported ATSV1 might be able to tell the
ATSV1 scale better !


Regards,

+ Naga

________________________________
From: Krzysztof Zarzycki [k.zarzycki@gmail.com]
Sent: Thursday, November 05, 2015 19:51
To: user@hadoop.apache.org
Subject: Re: YARN timelineserver process taking 600% CPU

Thanks Naga for your input,  (I'm sorry for a late response, I was out for some time).

So you believe that Spark is actually doing the PUTs? There are currently 8 Spark Streaming
jobs constantly running, each 3 with 1 second batch, 5 x 10 s. I believe these are the jobs
that publish to ATS.  How could I check what precisely is doing what or how to get some logs
about it, I don't know...
I though maybe it is Spark History Server doing the puts, but it seems it is not, as I disabled
it and the load hasn't gone down. So it seems these are the jobs itself indeed.

Now I have the following problems:
1. The most important: How can I at least workaround this issue? Maybe I will somehow disable
Spark usage of Yarn timelineserver ? What are the consequences? Is it only history of Spark
finished jobs not being saved? If yes, that doesn't hurt that much. Probably this is a question
to Spark group...
2. Is 8 concurrent Spark Streaming jobs really that high for Timelineserver? I have just a
small cluster, how other larger companies are handling much larger load?

Thanks for helping me with this!
Krzysiek










2015-10-05 20:45 GMT+02:00 Naganarasimha Garla <naganarasimha.gr@gmail.com<mailto:naganarasimha.gr@gmail.com>>:
Hi Krzysiek,
Oops My mistake, 3 Gb seems to be on little higher side.
And from the jstack it seems like there were no major activity other than puts seems like
around 16 concurrent puts were happening which tries to get the timeline Entity hence hitting
the native call.

>From the logs it seems like lot of ACL validations are happening and from the URL it seems
like its for PUTEntites.
approximately from 09:30:16 to 09:44:26 about 9213 checks have happened and if all of these
are for puts then roughly about 10 put calls/s is happening from spark side. This i feel is
not right usage of ATS, can you check what is being published from the spark to ATS at this
high rate ?

Besides some improvements regarding the timeline metrics is available in trunk as part of
YARN-3360 which could have been useful in analyzing your issue.

+ Naga


On Mon, Oct 5, 2015 at 1:19 PM, Krzysztof Zarzycki <k.zarzycki@gmail.com<mailto:k.zarzycki@gmail.com>>
wrote:
Hi Naga,
Sorry, but it's not 3MB, but 3GB in leveldb-timeline-store (du shows numbers in kB). Does
that seems reasonable as well?
There are new .sst files generated each minute.
There are now 26850 files in leveldb-timeline-store directory. New files are generated each
minute. Some are also being deleted.

I started timeline server today, to gather logs and jstack, it was running for ~20 minutes.
I attach the tar bz2 archive with those logs.

Thank you for helping me debug this.
Krzysiek





2015-09-30 21:00 GMT+02:00 Naganarasimha Garla <naganarasimha.gr@gmail.com<mailto:naganarasimha.gr@gmail.com>>:
Hi Krzysiek,
seems like the size is around 3 MB which seems to be fine. ,
Could you try enabling in debug and share the logs of ATS/AHS and also if possible the jstack
output for the AHS process

+ Naga

On Wed, Sep 30, 2015 at 10:27 PM, Krzysztof Zarzycki <k.zarzycki@gmail.com<mailto:k.zarzycki@gmail.com>>
wrote:
Hi Naga,
I see the following size:
$ sudo du --max=1 /var/lib/hadoop/yarn/timeline
36      /var/lib/hadoop/yarn/timeline/timeline-state-store.ldb
3307772 /var/lib/hadoop/yarn/timeline/leveldb-timeline-store.ldb
3307812 /var/lib/hadoop/yarn/timeline

The timeline service has been multiple times restarted as I was looking for issue with it.
But it was installed about a 2 months ago. Just few applications (1?2? ) has been started
since its last start. The ResourceManager interface has 261 entries.

As in yarn-site.xml that I attached, the variable you're asking for has the following value:

<property>

  <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
      <value>300000</value>
</property>

Ah, One more thing: When I looked with jstack to see what the process is doing, I saw threads
spending time in NATIVE in leveldbjni library. So I *think* it is related to leveldb store.

Please ask if any more information is needed.
Any help is appreciated! Thanks
Krzysiek

2015-09-30 16:23 GMT+02:00 Naganarasimha G R (Naga) <garlanaganarasimha@huawei.com<mailto:garlanaganarasimha@huawei.com>>:
Hi ,

Whats the size of Store Files?
Since when is it running ? how many applications have been run since it has been started ?
Whats the value of "yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms" ?

+ Naga
________________________________
From: Krzysztof Zarzycki [k.zarzycki@gmail.com<mailto:k.zarzycki@gmail.com>]
Sent: Wednesday, September 30, 2015 19:20
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: YARN timelineserver process taking 600% CPU

Hi there Hadoopers,
I have a serious issue with my installation of Hadoop & YARN in version 2.7.1 (HDP 2.3).
The timelineserver process ( more precisely org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
class) takes over 600% of CPU, generating enormous load on my master node. I can't guess why
it happens.

First, I run the timelineserver using java 8, thought that this was an issue. But no, I started
timelineserver now with use of java 7 and still the problem is the same.

My cluster is tiny- it consists of:
- 2 HDFS nodes
- 2 HBase RegionServers
- 2 Kafkas
- 2 Spark nodes
- 8 Spark Streaming jobs, processing around 100 messages/second TOTAL.

I'll be very grateful for your help here. If you need any more info, please write.
I also attach yarn-site.xml grepped to options related to timeline server.

And here is a command of timeline that I see from ps :
/usr/java/jdk1.7.0_79/bin/java -Dproc_timelineserver -Xmx1024m -Dhdp.version=2.3.0.0-2557
-Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn -Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log
-Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.home.dir= -Dyarn.id.str=yarn
-Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
-Dyarn.policy.file=hadoop-policy.xml -Dhadoop.log.dir=/var/log/hadoop-yarn/yarn -Dyarn.log.dir=/var/log/hadoop-yarn/yarn
-Dhadoop.log.file=yarn-yarn-timelineserver-hd-master-a01.log -Dyarn.log.file=yarn-yarn-timelineserver-hd-master-a01.log
-Dyarn.home.dir=/usr/hdp/current/hadoop-yarn-timelineserver -Dhadoop.home.dir=/usr/hdp/2.3.0.0-2557/hadoop
-Dhadoop.root.logger=INFO,EWMA,RFA -Dyarn.root.logger=INFO,EWMA,RFA -Djava.library.path=:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native:/usr/hdp/2.3.0.0-2557/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2557/hadoop/lib/native
-classpath /usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hadoop-client/conf:/usr/hdp/2.3.0.0-2557/hadoop/lib/*:/usr/hdp/2.3.0.0-2557/hadoop/.//*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/./:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-hdfs/.//*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-yarn/.//*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/lib/*:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/.//*:::/usr/share/java/mysql-connector-java.jar::/usr/share/java/mysql-connector-java.jar:/usr/hdp/current/hadoop-yarn-timelineserver/.//*:/usr/hdp/current/hadoop-yarn-timelineserver/lib/*:/usr/hdp/current/hadoop-client/conf/timelineserver-config/log4j.properties
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer


Thanks!
Krzysztof







Mime
View raw message