Return-Path: X-Original-To: apmail-atlas-dev-archive@minotaur.apache.org Delivered-To: apmail-atlas-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C145719179 for ; Thu, 14 Apr 2016 10:51:29 +0000 (UTC) Received: (qmail 95358 invoked by uid 500); 14 Apr 2016 10:51:29 -0000 Delivered-To: apmail-atlas-dev-archive@atlas.apache.org Received: (qmail 95314 invoked by uid 500); 14 Apr 2016 10:51:29 -0000 Mailing-List: contact dev-help@atlas.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@atlas.incubator.apache.org Delivered-To: mailing list dev@atlas.incubator.apache.org Received: (qmail 95303 invoked by uid 99); 14 Apr 2016 10:51:29 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Apr 2016 10:51:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 394AEC39D4 for ; Thu, 14 Apr 2016 10:51:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -4.216 X-Spam-Level: X-Spam-Status: No, score=-4.216 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.996] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 5FRnrkqUS7po for ; Thu, 14 Apr 2016 10:51:27 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with SMTP id 8B7C15F33E for ; Thu, 14 Apr 2016 10:51:26 +0000 (UTC) Received: (qmail 94443 invoked by uid 99); 14 Apr 2016 10:51:25 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Apr 2016 10:51:25 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 998002C1F5D for ; Thu, 14 Apr 2016 10:51:25 +0000 (UTC) Date: Thu, 14 Apr 2016 10:51:25 +0000 (UTC) From: "Hemanth Yamijala (JIRA)" To: dev@atlas.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (ATLAS-616) Zookeeper throws exceptions when trying to fire DSL queries at Atlas at large scale. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ATLAS-616?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:all-tabpanel ] Hemanth Yamijala updated ATLAS-616: ----------------------------------- Attachment: heap.png An update: As described above, all indications to cause of the problem were pointing t= owards the weak references that were holding on the GremlinGroovy script bi= ndings. From what I could see in the code, there are no knobs to adjust / t= une this value in the version of the library we are using. As a next step, I tried to see whether GC settings could be tuned to accomp= lish this, and ran across this link: http://stackoverflow.com/a/604395 whic= h pointed to a GC config {{-XX:SoftRefLRUPolicyMSPerMB=3D}}. Likewis= e, the Sun JDK documentation (http://docs.oracle.com/javase/7/docs/technote= s/tools/solaris/java.html) says:=20 bq. -XX:SoftRefLRUPolicyMSPerMB=3D0 This flag enables aggressive processing= of software references. Use this flag if the software reference count has = an impact on the Java HotSpot VM garbage collector. Given the above hints, I ran a test with this setting, set to 0 and also to= 100. In both cases, the GC performance dramatically improved and I was abl= e to increase the number of tests to get linear performance. [~ssainath] he= lped me to run these tests in a server environment (still with JDK 7) and g= ot similar results. The attached graph is from a server environment running= a total of 3600 queries. We even tested up to 7200 queries. Each run scale= d linearly with time, and the logs had no concurrency issues etc. The GC pa= tterns are stable as can be seen above. We are going to test on OpenJDK 8 as well to see what the impact is, and if= things go fine, I can put up a patch that just suggests the settings to en= able on the server for such loads. For reference, the GC settings I use are: {code} export ATLAS_OPTS=3D"-server -XX:SoftRefLRUPolicyMSPerMB=3D0 -XX:MaxNewSize= =3D3072m -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSPara= llelRemarkEnabled -XX:MaxPermSize=3D512m -Djava.net.preferIPv4Stack=3Dtrue = -Xmx10240m -Xms10240m -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOf= MemoryError -XX:HeapDumpPath=3Ddumps/atlas_server.hprof -XX:PermSize=3D100M= -Xloggc:logs/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:Numbe= rOfGCLogFiles=3D10 -XX:GCLogFileSize=3D1m -XX:+PrintGCDetails -XX:+PrintHea= pAtGC -XX:+PrintGCTimeStamps -Dlog4j.configuration=3Datlas-log4j.xml" {code} In addition to this effort, I also plan to write on the Tinkerpop mailing l= ist to see if they have any suggestions for tuning this / fixing this in co= de. > Zookeeper throws exceptions when trying to fire DSL queries at Atlas at l= arge scale.=20 > -------------------------------------------------------------------------= ------------ > > Key: ATLAS-616 > URL: https://issues.apache.org/jira/browse/ATLAS-616 > Project: Atlas > Issue Type: Bug > Environment: Atlas with External kafka / HBase / Solr > The test is run on cluster setup. > Machine 1 - Atlas , Solr > Machine 2 - Kafka , HBase > Machine 3 - Hive , client > Reporter: Sharmadha Sainath > Assignee: Hemanth Yamijala > Attachments: baseline-1000-3600-10g-heap.png, heap.png, no-dsl-10= 00-14400-10g-heap.png, zk-exception-stacktrace.rtf > > > The test plan is to simulate 'n' number of users fire 'm' number of queri= es at Atlas simultaneously. This is accomplished with the help of Apache Jm= eter. > Atlas is populated with 10,000 tables.=20 > =E2=80=A2 6000 small sized tables (10 columns) > =E2=80=A2 3000 medium sized tables (50 columns) > =E2=80=A2 1000 large sized tables (100 columns) > The test plan consists of 30 users firing a set of 3 queries continuousl= y for 20 times in a loop. Added -Xmx10240m -XX:MaxPermSize=3D512m to ATLAS_= OPTS . Zookeeper throws exceptions when the test plan is run and Jmeter sta= rts firing queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)