gora-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r923971 - in /websites/staging/gora/trunk/content: ./ current/index.html
Date Mon, 29 Sep 2014 00:47:12 GMT
Author: buildbot
Date: Mon Sep 29 00:47:11 2014
New Revision: 923971

Staging update by buildbot for gora

    websites/staging/gora/trunk/content/   (props changed)

Propchange: websites/staging/gora/trunk/content/
--- cms:source-revision (original)
+++ cms:source-revision Mon Sep 29 00:47:11 2014
@@ -1 +1 @@

Modified: websites/staging/gora/trunk/content/current/index.html
--- websites/staging/gora/trunk/content/current/index.html (original)
+++ websites/staging/gora/trunk/content/current/index.html Mon Sep 29 00:47:11 2014
@@ -163,10 +163,20 @@ under the License. 
 <li><a href="#gora-modules">Gora Modules</a></li>
 <li><a href="#gora-testing">Gora Testing</a><ul>
 <li><a href="#junit-tests">JUnit Tests</a></li>
-<li><a href="#goraci-integration-testsing-suite">GoraCI Integration Testsing
+<li><a href="#goraci-integration-testsing-suite">GoraCI Integration Testsing
+<li><a href="#background">Background</a></li>
+<li><a href="#the-anatomy-of-goraci-tests">The Anatomy of GoraCI tests</a></li>
+<li><a href="#building-goraci">Building GoraCI</a></li>
+<li><a href="#java-class-description">Java Class Description</a></li>
+<li><a href="#gora-and-hadoop">GORA AND HADOOP</a></li>
+<li><a href="#goraci-and-hbase">GORACI AND HBASE</a></li>
+<li><a href="#concurrency">CONCURRENCY</a></li>
+<li><a href="#conclusions">CONCLUSIONS</a></li>
 <p>This is the main entry point for Gora documentation. Here are some pointers for
further info:</p>
@@ -217,6 +227,204 @@ modules contain a <code>/src/examples/</
 classes can be found. Specifically, there are some classes that are used for tests 
 under <a href="https://github.com/apache/gora/tree/master/gora-core/src/examples">gora-core/src/examples/</a>.</p>
 <h3 id="goraci-integration-testsing-suite">GoraCI Integration Testsing Suite</h3>
+<h4 id="background">Background</h4>
+<p>Since Gora 0.5, the GoraCI suite has been part of the mainstream Gora codebase.</p>
+<p>Credit for GoraCI can be handed to Keith Turner (Gora PMC member) for his foresight
+in developing GoraCI which we have now extended from gora-accumulo to the entire suite
+of Gora modules.</p>
+<p><a href="http://accumulo.apache.org">Apache Accumulo</a> has a test
suite that verifies that data is not lost
+at scale.  This test suite is called 
+<a href="http://svn.apache.org/viewvc/accumulo/tags/1.4.0/test/system/continuous/ScaleTest.odp?view=co">continuous
ingest</a>.<br />
+Essentially the test runs many ingest clients that continually create linked lists containing
<strong>25 million</strong>
+nodes. At some point the clients are stopped and a map reduce job is run to
+ensure no linked list has a hole. A hole indicates data was lost.    </p>
+<p>The nodes in the linked list are random.  This causes each linked list to
+spread across the table.  Therefore if one part of a table loses data, then it
+will be detected by references in another part of the table.</p>
+<p>This project is a version of the test suite written using Apache Gora [1].
+Goraci has been tested against Accumulo and HBase.  </p>
+<h4 id="the-anatomy-of-goraci-tests">The Anatomy of GoraCI tests</h4>
+<p>Below is rough sketch of how data is written.  For specific details look at the
+<a href="https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Generator.java">Generator
+<li>Write out 1 million nodes </li>
+<li>Flush the client </li>
+<li>Write out 1 million that reference previous million </li>
+<li>If this is the 25th set of 1 million nodes, then update 1st set of million
+    to point to last </li>
+<li>goto 1</li>
+<p>The key is that nodes only reference flushed nodes.  Therefore a node should
+never reference a missing node, even if the ingest client is killed at any
+point in time.</p>
+<p>When running this test suite w/ Accumulo there is a script running in parallel
+called the Aggitator that randomly and continuously kills server processes.<br />
+The outcome was that many data loss bugs were found in Accumulo by doing this. 
+This test suite can also help find bugs that impact uptime and stability when 
+run for days or weeks.  </p>
+<p>This test suite consists the following </p>
+<li>a few Java programs </li>
+<li>a little helper script to run the java programs</li>
+<li>a maven script to build it.  </li>
+<p>When generating data, its best to have each map task generate a multiple of 25
+million.  The reason for this is that circular linked list are generated every
+25M.  Not generating a multiple in 25M will result in some nodes in the linked
+list not having references.  The loss of an unreferenced node can not be
+<h4 id="building-goraci">Building GoraCI</h4>
+<p>As GoraCI is packaged with the Gora master branch source it is automatically 
+built every time you execute</p>
+<p><code>mvn install</code></p>
+<p>The maven pom file has some profiles that attempt to make it easier to run
+GoraCI against different Gora backends by copying the jars you need into <code>lib</code>.
+Before packaging its important to edit <code>gora.properties</code> and set it
+for your datastore.  To run against Accumulo do the following.</p>
+  vim src/main/resources/gora.properties //set Accumulo properties
+  mvn package -Paccumulo-1.4
+<p>To run against HBase, do the following.</p>
+  vim src/main/resources/gora.properties //set HBase properties
+  mvn package -Phbase-0.92
+<p>To run against Cassandra, do the following.</p>
+  vim src/main/resources/gora.properties //set Cassandra properties
+  mvn package -Pcassandra-1.1.2
+<p>For other datastores mentioned in <code>gora.properties</code>, you
will need to copy the
+appropriate deps into <code>lib</code>.  Feel free to update the pom with other
profiles, <a href="https://issues.apache.org/jira/browse/GORA/">open
+a ticket</a> or just <a href="https://github.com/apache/gora/">send us a pull
+<h4 id="java-class-description">Java Class Description</h4>
+<p>Below is a description of the Java programs</p>
+<li><a href="https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Generator.java">org.apache.gora.goraci.Generator</a>
- A map only job that generates data.  As stated previously, 
+                       its best to generate data in multiples of 25M.</li>
+<li><a href="https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Verify.java">org.apache.gora.goraci.Verify</a>
   - A map reduce job that looks for holes.  Look at the
+                       counts after running.  REFERENCED and UNREFERENCED are 
+                       ok, any UNDEFINED counts are bad. Do not run at the 
+                       same time as the Generator.</li>
+<li><a href="https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Walker.java">org.apache.gora.goraci.Walker</a>
   - A standalong program that start following a linked list 
+                       and emits timing info.  </li>
+<li><a href="https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Print.java">org.apache.gora.goraci.Print</a>
    - A standalone program that prints nodes in the linked list</li>
+<li><a href="https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Delete.java">org.apache.gora.goraci.Delete</a>
   - A standalone program that deletes a single node</li>
+<li><a href="https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Loop.java">org.apache.gora.goraci.Loop</a>
     - Runs generation and verify in a loop</li>
+<p>org.apache.gora.goraci.sh is a helper script that you can use to run the above programs.
+assumes all needed jars are in the lib dir.  It does not need the package name.
+You can just run "./org.apache.gora.goraci.sh Generator", below is an example.</p>
+<p>$ ./org.apache.gora.goraci.sh Generator
+  Usage : Generator <num mappers> <num nodes></p>
+<p>For Gora to work, it needs a gora.properties file on the classpath and a
+mapping file on the classpath, the contents of both are datastore specific,
+more details can be found here [2]. You can edit the ones in src/main/resources
+and build the org.apache.gora.goraci-${version}-SNAPSHOT.jar with those. Alternatively remove
+those and put them on the classpath through some other means.</p>
+<h2 id="gora-and-hadoop">GORA AND HADOOP</h2>
+<p>Gora uses Avro which uses a Json library that Hadoop has an old version of.
+The two libraries  jackson-core and jackson-mapper need to be updated in
+<HADOOP_HOME>/lib and <HADOOP_HOME>/share/hadoop/lib/.  I updated these to
+jackson-core-asl-1.4.2.jar and jackson-mapper-asl-1.4.2.jar.  For details see
+HADOOP-6945 [3]. </p>
+<h2 id="goraci-and-hbase">GORACI AND HBASE</h2>
+<p>To improve performance running read jobs such as the Verify step, enable
+scanner caching on the command line.  For example:</p>
+<div class="codehilite"><pre>$ <span class="o">./</span><span
class="n">gorachi</span><span class="p">.</span><span class="n">sh</span>
<span class="n">Verify</span> <span class="o">-</span><span class="n">Dhbase</span><span
class="p">.</span><span class="n">client</span><span class="p">.</span><span
class="n">scanner</span><span class="p">.</span><span class="n">caching</span><span
class="p">=</span>1000 <span class="o">\</span>
+     <span class="o">-</span><span class="n">Dmapred</span><span
class="p">.</span><span class="n">map</span><span class="p">.</span><span
class="n">tasks</span><span class="p">.</span><span class="n">speculative</span><span
class="p">.</span><span class="n">execution</span><span class="p">=</span><span
class="n">false</span> <span class="n">verify_dir</span> 1000
+<p>Dependent on how you have your hadoop and hbase deployed, you may need to
+change the gorachi.sh script around some.  Here is one suggestion that may help
+in the case where your hadoop and hbase configuration are other than under the
+hadoop and hbase home directories.</p>
+<p>diff --git a/org.apache.gora.goraci.sh b/org.apache.gora.goraci.sh
+  index db1562a..31c3c94 100755
+  --- a/org.apache.gora.goraci.sh
+  +++ b/org.apache.gora.goraci.sh
+  @@ -95,6 +95,4 @@ done
+   #run it
+   LIBJARS=<code>echo $HADOOP_CLASSPATH | tr : ,</code>
+  -hadoop jar "$GORACI_HOME/lib/org.apache.gora.goraci-0.0.1-SNAPSHOT.jar" $CLASS -libjars
"$LIBJARS" "$@"
+  -
+  -
+  +CLASSPATH="${HBASE_CONF_DIR}" hadoop --config "${HADOOP_CONF_DIR} jar "$GORACI_HOME/lib/org.apache.gora.goraci-0.0.1-SNAPSHOT.jar"
$CLASS -files "${HBASE_CONF_DIR}/hbase-site.xml" -libjars "$LIBJARS" "$@"</p>
+<p>You will need to define HBASE_CONF_DIR and HADOOP_CONF_DIR before you run your
+org.apache.gora.goraci jobs.  For example:</p>
+<p>$ export HADOOP_CONF_DIR=/home/you/hadoop-conf
+  $ export HBASE_CONF_DIR=/home/you/hbase-conf
+  $ PATH=/home/you/hadoop-1.0.2/bin:$PATH ./org.apache.gora.goraci.sh Generator 1000 1000000</p>
+<h2 id="concurrency">CONCURRENCY</h2>
+<p>Its possible to run verification at the same time as generation.  To do this
+supply the -c option to Generator and Verify.  This will cause Genertor to
+create a secondary table which holds information about what verification can
+safely verify.  Running Verify with the -c option will make it run slower
+because more information must be brought back to the client side for filtering
+purposes.  The Loop program also supports the -c option, which will cause it to
+run verification concurrently with generation.</p>
+<p>If verification is run at the same time as generation without the -c option,
+then it will inevitably fail.  This is because verification mappers read
+different parts of the table at different times and giving an inconsistent view
+of the table.  So one mapper may read a part of a table before a node is
+written, when the node is later referenced it will appear to be missing.  The
+-c option basically filters out newer information using data written to the
+secondary table.</p>
+<h2 id="conclusions">CONCLUSIONS</h2>
+<p>This test suite does not do everything that the Accumulo test suite does,
+mainly it does not collect statistics and generate reports.  The reports
+are useful for assesing performance.</p>
+<p>Below shows running a test of the test.  Ingest one linked list, deleted a node
+in it, ensure the verifaction map reduce job notices that the node is missing.
+Not all output is shown, just the important parts.</p>
+<p>$ ./org.apache.gora.goraci.sh Generator  1 25000000
+  $ ./org.apache.gora.goraci.sh Print -s 2000000000000000 -l 1
+  2000001f65dbd238:30350f9ae6f6e8f7:000004265852:ef09f9dd-75b1-4c16-9f14-0fa84f3029b6
+  $ ./org.apache.gora.goraci.sh Print -s 30350f9ae6f6e8f7 -l 1
+  30350f9ae6f6e8f7:4867fe03de6ea6c8:000003265852:ef09f9dd-75b1-4c16-9f14-0fa84f3029b6
+  $ ./org.apache.gora.goraci.sh Delete 30350f9ae6f6e8f7
+  Delete returned true
+  $ ./org.apache.gora.goraci.sh Verify gci_verify_1 2 
+  11/12/20 17:12:31 INFO mapred.JobClient:   org.apache.gora.goraci.Verify$Counts
+  11/12/20 17:12:31 INFO mapred.JobClient:     UNDEFINED=1
+  11/12/20 17:12:31 INFO mapred.JobClient:     REFERENCED=24999998
+  11/12/20 17:12:31 INFO mapred.JobClient:     UNREFERENCED=1
+  $ hadoop fs -cat gci_verify_1/part*
+  30350f9ae6f6e8f7  2000001f65dbd238</p>
+<p>The map reduce job found the one undefined node and gave the node that
+referenced it.</p>
+<p>Below are some timing statistics for running org.apache.gora.goraci on a 10 node
cluster. </p>
+<p>Store           | Task                   | Time    | Undef  | Unref | Ref      <br
+  ----------------+------------------------+---------+--------+-------+------------
+  accumulo-1.4.0  | Generator 10 100000000 | 40m 16s |    N/A |   N/A |        N/A   <br
+  accumulo-1.4.0  | Verify /tmp/goraci1 40 |  6m  7s |      0 |     0 | 1000000000<br
+  hbase-0.92.1    | Generator 10 100000000 |  2h 44m |    N/A |   N/A |        N/A   <br
+  hbase-0.92.1    | Verify /tmp/goraci2 40 |  6m 34s |      0 |     0 | 1000000000</p>
+<p>Hbase and Accumulo are configured differently out-of-the-box.  We used the Accumulo

+3G, native configuration examples in the conf/examples directory.</p>
+<p>To provide a comparable memory footprint, we increased the HBase jvm to "-Xmx4000m",

+and turned on compression for the ci table:</p>
+<p>create 'ci', {NAME=&gt;'meta', COMPRESSION=&gt;'GZ'}</p>
+<p>We also turned down the replication of write-ahead logs to be comparable to Accumulo:</p>
+    <name>hbase.regionserver.hlog.replication</name>
+    <value>2</value>
+  </property></p>
+<p>For the accumulo run, we set the split threshold to 512M:</p>
+<p>shell&gt; config -t ci -s table.split.threshold=512M</p>
+<p>This was done so that Accumulo would end up with 64 tablets, which is the
+number of regions hbase had.   The number of tablets/regions determines how
+much parallelism there is in the map phase of the verify step.</p>
+<p>Sometimes when this test suite is run against HBase data is lost.  This issue
+is being tracked under HBASE-5754 [4].</p>
+<p>[0] http://accumulo.apache.org
+[1] http://gora.apache.org
+[2] http://gora.apache.org/docs/current/gora-conf.html
+[3] https://issues.apache.org/jira/browse/HADOOP-6945
+[4] https://issues.apache.org/jira/browse/HBASE-5754</p>
   </div> <!-- /container (main block) -->

View raw message