atlas-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <>
Subject [jira] [Updated] (ATLAS-616) Zookeeper throws exceptions when trying to fire DSL queries at Atlas at large scale.
Date Thu, 14 Apr 2016 10:51:25 GMT


Hemanth Yamijala updated ATLAS-616:
    Attachment: heap.png

An update:

As described above, all indications to cause of the problem were pointing towards the weak
references that were holding on the GremlinGroovy script bindings. From what I could see in
the code, there are no knobs to adjust / tune this value in the version of the library we
are using.

As a next step, I tried to see whether GC settings could be tuned to accomplish this, and
ran across this link: which pointed to a GC config {{-XX:SoftRefLRUPolicyMSPerMB=<value>}}.
Likewise, the Sun JDK documentation (

bq. -XX:SoftRefLRUPolicyMSPerMB=0 This flag enables aggressive processing of software references.
Use this flag if the software reference count has an impact on the Java HotSpot VM garbage

Given the above hints, I ran a test with this setting, set to 0 and also to 100. In both cases,
the GC performance dramatically improved and I was able to increase the number of tests to
get linear performance. [~ssainath] helped me to run these tests in a server environment (still
with JDK 7) and got similar results. The attached graph is from a server environment running
a total of 3600 queries. We even tested up to 7200 queries. Each run scaled linearly with
time, and the logs had no concurrency issues etc. The GC patterns are stable as can be seen

We are going to test on OpenJDK 8 as well to see what the impact is, and if things go fine,
I can put up a patch that just suggests the settings to enable on the server for such loads.

For reference, the GC settings I use are:
export ATLAS_OPTS="-server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:MaxNewSize=3072m -XX:+CMSClassUnloadingEnabled
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:MaxPermSize=512m
 -Xmx10240m -Xms10240m -XX:+PrintTenuringDistribution  -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof
-XX:PermSize=100M -Xloggc:logs/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -Dlog4j.configuration=atlas-log4j.xml"

In addition to this effort, I also plan to write on the Tinkerpop mailing list to see if they
have any suggestions for tuning this / fixing this in code.

> Zookeeper throws exceptions when trying to fire DSL queries at Atlas at large scale.

> -------------------------------------------------------------------------------------
>                 Key: ATLAS-616
>                 URL:
>             Project: Atlas
>          Issue Type: Bug
>         Environment: Atlas with External kafka / HBase / Solr
> The test is run on cluster setup.
> Machine 1 - Atlas , Solr
> Machine 2 - Kafka , HBase
> Machine 3 - Hive , client
>            Reporter: Sharmadha Sainath
>            Assignee: Hemanth Yamijala
>         Attachments: baseline-1000-3600-10g-heap.png, heap.png, no-dsl-1000-14400-10g-heap.png,
> The test plan is to simulate 'n' number of users fire 'm' number of queries at Atlas
simultaneously. This is accomplished with the help of Apache Jmeter.
> Atlas is populated with 10,000 tables. 
> • 6000 small sized tables (10 columns)
> • 3000 medium sized tables (50 columns)
> • 1000 large sized tables (100 columns)
>  The test plan consists of 30 users firing a set of 3 queries continuously for 20 times
in a loop. Added -Xmx10240m -XX:MaxPermSize=512m to ATLAS_OPTS . Zookeeper throws exceptions
when the test plan is run and Jmeter starts firing queries.

This message was sent by Atlassian JIRA

View raw message