gora-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lewi...@apache.org
Subject svn commit: r1628108 - /gora/site/trunk/content/current/index.md
Date Mon, 29 Sep 2014 00:47:00 GMT
Author: lewismc
Date: Mon Sep 29 00:47:00 2014
New Revision: 1628108

URL: http://svn.apache.org/r1628108
Log:
Update GoraCI documentation

Modified:
    gora/site/trunk/content/current/index.md

Modified: gora/site/trunk/content/current/index.md
URL: http://svn.apache.org/viewvc/gora/site/trunk/content/current/index.md?rev=1628108&r1=1628107&r2=1628108&view=diff
==============================================================================
--- gora/site/trunk/content/current/index.md (original)
+++ gora/site/trunk/content/current/index.md Mon Sep 29 00:47:00 2014
@@ -58,4 +58,255 @@ modules contain a <code>/src/examples/</
 classes can be found. Specifically, there are some classes that are used for tests 
 under [gora-core/src/examples/](https://github.com/apache/gora/tree/master/gora-core/src/examples).
 
-###GoraCI Integration Testsing Suite
\ No newline at end of file
+###GoraCI Integration Testsing Suite
+
+####Background
+Since Gora 0.5, the GoraCI suite has been part of the mainstream Gora codebase.
+
+Credit for GoraCI can be handed to Keith Turner (Gora PMC member) for his foresight
+in developing GoraCI which we have now extended from gora-accumulo to the entire suite
+of Gora modules.
+
+[Apache Accumulo](http://accumulo.apache.org) has a test suite that verifies that data is
not lost
+at scale.  This test suite is called 
+[continuous ingest](http://svn.apache.org/viewvc/accumulo/tags/1.4.0/test/system/continuous/ScaleTest.odp?view=co).
 
+Essentially the test runs many ingest clients that continually create linked lists containing
**25 million**
+nodes. At some point the clients are stopped and a map reduce job is run to
+ensure no linked list has a hole. A hole indicates data was lost.    
+
+The nodes in the linked list are random.  This causes each linked list to
+spread across the table.  Therefore if one part of a table loses data, then it
+will be detected by references in another part of the table.
+
+This project is a version of the test suite written using Apache Gora [1].
+Goraci has been tested against Accumulo and HBase.  
+
+####The Anatomy of GoraCI tests
+
+Below is rough sketch of how data is written.  For specific details look at the
+[Generator code](https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Generator.java)
+
+  1. Write out 1 million nodes 
+  2. Flush the client 
+  3. Write out 1 million that reference previous million 
+  4. If this is the 25th set of 1 million nodes, then update 1st set of million
+    to point to last 
+  5. goto 1
+
+The key is that nodes only reference flushed nodes.  Therefore a node should
+never reference a missing node, even if the ingest client is killed at any
+point in time.
+
+When running this test suite w/ Accumulo there is a script running in parallel
+called the Aggitator that randomly and continuously kills server processes.  
+The outcome was that many data loss bugs were found in Accumulo by doing this. 
+This test suite can also help find bugs that impact uptime and stability when 
+run for days or weeks.  
+
+This test suite consists the following 
+
+ * a few Java programs 
+ * a little helper script to run the java programs
+ * a maven script to build it.  
+
+When generating data, its best to have each map task generate a multiple of 25
+million.  The reason for this is that circular linked list are generated every
+25M.  Not generating a multiple in 25M will result in some nodes in the linked
+list not having references.  The loss of an unreferenced node can not be
+detected.
+
+####Building GoraCI
+
+As GoraCI is packaged with the Gora master branch source it is automatically 
+built every time you execute
+
+<code>mvn install</code>
+
+The maven pom file has some profiles that attempt to make it easier to run
+GoraCI against different Gora backends by copying the jars you need into <code>lib</code>.
+Before packaging its important to edit <code>gora.properties</code> and set it
correctly
+for your datastore.  To run against Accumulo do the following.
+
+<code>
+  vim src/main/resources/gora.properties //set Accumulo properties
+  mvn package -Paccumulo-1.4
+</code>
+
+To run against HBase, do the following.
+
+<code>
+  vim src/main/resources/gora.properties //set HBase properties
+  mvn package -Phbase-0.92
+</code>
+
+To run against Cassandra, do the following.
+
+<code>
+  vim src/main/resources/gora.properties //set Cassandra properties
+  mvn package -Pcassandra-1.1.2
+</code>
+
+For other datastores mentioned in <code>gora.properties</code>, you will need
to copy the
+appropriate deps into <code>lib</code>.  Feel free to update the pom with other
profiles, [open
+a ticket](https://issues.apache.org/jira/browse/GORA/) or just [send us a pull request](https://github.com/apache/gora/).
+
+####Java Class Description
+
+Below is a description of the Java programs
+
+  * [org.apache.gora.goraci.Generator](https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Generator.java)
- A map only job that generates data.  As stated previously, 
+                       its best to generate data in multiples of 25M.
+  * [org.apache.gora.goraci.Verify](https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Verify.java)
   - A map reduce job that looks for holes.  Look at the
+                       counts after running.  REFERENCED and UNREFERENCED are 
+                       ok, any UNDEFINED counts are bad. Do not run at the 
+                       same time as the Generator.
+  * [org.apache.gora.goraci.Walker](https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Walker.java)
   - A standalong program that start following a linked list 
+                       and emits timing info.  
+  * [org.apache.gora.goraci.Print](https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Print.java)
    - A standalone program that prints nodes in the linked list
+  * [org.apache.gora.goraci.Delete](https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Delete.java)
   - A standalone program that deletes a single node
+  * [org.apache.gora.goraci.Loop](https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Loop.java)
     - Runs generation and verify in a loop
+
+org.apache.gora.goraci.sh is a helper script that you can use to run the above programs.
 It
+assumes all needed jars are in the lib dir.  It does not need the package name.
+You can just run "./org.apache.gora.goraci.sh Generator", below is an example.
+
+  $ ./org.apache.gora.goraci.sh Generator
+  Usage : Generator <num mappers> <num nodes>
+
+For Gora to work, it needs a gora.properties file on the classpath and a
+mapping file on the classpath, the contents of both are datastore specific,
+more details can be found here [2]. You can edit the ones in src/main/resources
+and build the org.apache.gora.goraci-${version}-SNAPSHOT.jar with those. Alternatively remove
+those and put them on the classpath through some other means.
+
+GORA AND HADOOP
+-----------------
+
+Gora uses Avro which uses a Json library that Hadoop has an old version of.
+The two libraries  jackson-core and jackson-mapper need to be updated in
+<HADOOP_HOME>/lib and <HADOOP_HOME>/share/hadoop/lib/.  I updated these to
+jackson-core-asl-1.4.2.jar and jackson-mapper-asl-1.4.2.jar.  For details see
+HADOOP-6945 [3]. 
+
+GORACI AND HBASE
+-----------------
+
+To improve performance running read jobs such as the Verify step, enable
+scanner caching on the command line.  For example:
+
+    $ ./gorachi.sh Verify -Dhbase.client.scanner.caching=1000 \
+         -Dmapred.map.tasks.speculative.execution=false verify_dir 1000
+
+Dependent on how you have your hadoop and hbase deployed, you may need to
+change the gorachi.sh script around some.  Here is one suggestion that may help
+in the case where your hadoop and hbase configuration are other than under the
+hadoop and hbase home directories.
+
+  diff --git a/org.apache.gora.goraci.sh b/org.apache.gora.goraci.sh
+  index db1562a..31c3c94 100755
+  --- a/org.apache.gora.goraci.sh
+  +++ b/org.apache.gora.goraci.sh
+  @@ -95,6 +95,4 @@ done
+   #run it
+   export HADOOP_CLASSPATH="$CLASSPATH"
+   LIBJARS=`echo $HADOOP_CLASSPATH | tr : ,`
+  -hadoop jar "$GORACI_HOME/lib/org.apache.gora.goraci-0.0.1-SNAPSHOT.jar" $CLASS -libjars
"$LIBJARS" "$@"
+  -
+  -
+  +CLASSPATH="${HBASE_CONF_DIR}" hadoop --config "${HADOOP_CONF_DIR} jar "$GORACI_HOME/lib/org.apache.gora.goraci-0.0.1-SNAPSHOT.jar"
$CLASS -files "${HBASE_CONF_DIR}/hbase-site.xml" -libjars "$LIBJARS" "$@"
+
+You will need to define HBASE_CONF_DIR and HADOOP_CONF_DIR before you run your
+org.apache.gora.goraci jobs.  For example:
+
+  $ export HADOOP_CONF_DIR=/home/you/hadoop-conf
+  $ export HBASE_CONF_DIR=/home/you/hbase-conf
+  $ PATH=/home/you/hadoop-1.0.2/bin:$PATH ./org.apache.gora.goraci.sh Generator 1000 1000000
+
+CONCURRENCY
+------------
+
+Its possible to run verification at the same time as generation.  To do this
+supply the -c option to Generator and Verify.  This will cause Genertor to
+create a secondary table which holds information about what verification can
+safely verify.  Running Verify with the -c option will make it run slower
+because more information must be brought back to the client side for filtering
+purposes.  The Loop program also supports the -c option, which will cause it to
+run verification concurrently with generation.
+
+If verification is run at the same time as generation without the -c option,
+then it will inevitably fail.  This is because verification mappers read
+different parts of the table at different times and giving an inconsistent view
+of the table.  So one mapper may read a part of a table before a node is
+written, when the node is later referenced it will appear to be missing.  The
+-c option basically filters out newer information using data written to the
+secondary table.
+
+CONCLUSIONS
+------------
+
+This test suite does not do everything that the Accumulo test suite does,
+mainly it does not collect statistics and generate reports.  The reports
+are useful for assesing performance.
+
+Below shows running a test of the test.  Ingest one linked list, deleted a node
+in it, ensure the verifaction map reduce job notices that the node is missing.
+Not all output is shown, just the important parts.
+
+  $ ./org.apache.gora.goraci.sh Generator  1 25000000
+  $ ./org.apache.gora.goraci.sh Print -s 2000000000000000 -l 1
+  2000001f65dbd238:30350f9ae6f6e8f7:000004265852:ef09f9dd-75b1-4c16-9f14-0fa84f3029b6
+  $ ./org.apache.gora.goraci.sh Print -s 30350f9ae6f6e8f7 -l 1
+  30350f9ae6f6e8f7:4867fe03de6ea6c8:000003265852:ef09f9dd-75b1-4c16-9f14-0fa84f3029b6
+  $ ./org.apache.gora.goraci.sh Delete 30350f9ae6f6e8f7
+  Delete returned true
+  $ ./org.apache.gora.goraci.sh Verify gci_verify_1 2 
+  11/12/20 17:12:31 INFO mapred.JobClient:   org.apache.gora.goraci.Verify$Counts
+  11/12/20 17:12:31 INFO mapred.JobClient:     UNDEFINED=1
+  11/12/20 17:12:31 INFO mapred.JobClient:     REFERENCED=24999998
+  11/12/20 17:12:31 INFO mapred.JobClient:     UNREFERENCED=1
+  $ hadoop fs -cat gci_verify_1/part\*
+  30350f9ae6f6e8f7	2000001f65dbd238
+
+The map reduce job found the one undefined node and gave the node that
+referenced it.
+
+Below are some timing statistics for running org.apache.gora.goraci on a 10 node cluster.

+
+  Store           | Task                   | Time    | Undef  | Unref | Ref        
+  ----------------+------------------------+---------+--------+-------+------------
+  accumulo-1.4.0  | Generator 10 100000000 | 40m 16s |    N/A |   N/A |        N/A     
+  accumulo-1.4.0  | Verify /tmp/goraci1 40 |  6m  7s |      0 |     0 | 1000000000  
+  hbase-0.92.1    | Generator 10 100000000 |  2h 44m |    N/A |   N/A |        N/A     
+  hbase-0.92.1    | Verify /tmp/goraci2 40 |  6m 34s |      0 |     0 | 1000000000
+
+Hbase and Accumulo are configured differently out-of-the-box.  We used the Accumulo 
+3G, native configuration examples in the conf/examples directory.
+
+To provide a comparable memory footprint, we increased the HBase jvm to "-Xmx4000m", 
+and turned on compression for the ci table:
+
+create 'ci', {NAME=>'meta', COMPRESSION=>'GZ'}
+
+We also turned down the replication of write-ahead logs to be comparable to Accumulo:
+
+  <property>
+    <name>hbase.regionserver.hlog.replication</name>
+    <value>2</value>
+  </property>
+
+For the accumulo run, we set the split threshold to 512M:
+
+ shell> config -t ci -s table.split.threshold=512M
+
+This was done so that Accumulo would end up with 64 tablets, which is the
+number of regions hbase had.   The number of tablets/regions determines how
+much parallelism there is in the map phase of the verify step.
+
+Sometimes when this test suite is run against HBase data is lost.  This issue
+is being tracked under HBASE-5754 [4].
+
+[0] http://accumulo.apache.org
+[1] http://gora.apache.org
+[2] http://gora.apache.org/docs/current/gora-conf.html
+[3] https://issues.apache.org/jira/browse/HADOOP-6945
+[4] https://issues.apache.org/jira/browse/HBASE-5754



Mime
View raw message