Apache Accumulo User Manual: Table Design

Next: High-Speed Ingest Up: Apache Accumulo User Manual Version 1.4 Previous: Table Configuration Contents
-
Next: High-Speed Ingest Up: Apache Accumulo User Manual Version 1.4 Previous: Table Configuration Contents
Subsections
Basic Table
We might choose to store this data using the userid as the rowID and the rest of the data in column families:
Mutation m = new Mutation(new Text(userid));
-m.put(new Text("age"), age);
-m.put(new Text("address"), address);
-m.put(new Text("balance"), account_balance);
+Mutation m = new Mutation(new Text(userid));
+m.put(new Text("age"), age);
+m.put(new Text("address"), address);
+m.put(new Text("balance"), account_balance);
 
-writer.add(m);
+writer.add(m);
 
 
 
 We could then retrieve any of the columns for a specific userid by specifying the userid as the range of a scanner and fetching specific columns: 
-Range r = new Range(userid, userid); // single row
-Scanner s = conn.createScanner("userdata", auths);
-s.setRange(r);
-s.fetchColumnFamily(new Text("age"));
+Range r = new Range(userid, userid); // single row
+Scanner s = conn.createScanner("userdata", auths);
+s.setRange(r);
+s.fetchColumnFamily(new Text("age"));
 
-for(Entry<Key,Value> entry : s)
-    System.out.println(entry.getValue().toString());
+for(Entry<Key,Value> entry : s)
+    System.out.println(entry.getValue().toString());
 
 
 
  RowID Design
 Often it is necessary to transform the rowID in order to have rows ordered in a way that is optimal for anticipated access patterns. A good example of this is reversing the order of components of internet domain names in order to group rows of the same parent domain together: 
-com.google.code
-com.google.labs
-com.google.mail
-com.yahoo.mail
-com.yahoo.research
+com.google.code
+com.google.labs
+com.google.mail
+com.yahoo.mail
+com.yahoo.research
 
 
 
 Some data may result in the creation of very large rows - rows with many columns. In this case the table designer may wish to split up these rows for better load balancing while keeping them sorted together for scanning purposes. This can be done by appending a random substring at the end of the row: 
-com.google.code_00
-com.google.code_01
-com.google.code_02
-com.google.labs_00
-com.google.mail_00
-com.google.mail_01
+com.google.code_00
+com.google.code_01
+com.google.code_02
+com.google.labs_00
+com.google.mail_00
+com.google.mail_01
 
 
 
 It could also be done by adding a string representation of some period of time such as date to the week or month: 
-com.google.code_201003
-com.google.code_201004
-com.google.code_201005
-com.google.labs_201003
-com.google.mail_201003
-com.google.mail_201004
+com.google.code_201003
+com.google.code_201004
+com.google.code_201005
+com.google.labs_201003
+com.google.mail_201003
+com.google.mail_201004
 
 
 
@@ -164,27 +164,27 @@
 Note: We store rowIDs in the column qualifier rather than the Value so that we can have more than one rowID associated with a particular term within the index. If we stored this in the Value we would only see one of the rows in which the value appears since Accumulo is configured by default to return the one most recent value associated with a key. 
 Lookups can then be done by scanning the Index Table first for occurrences of the desired values in the columns specified, which returns a list of row ID from the main table. These can then be used to retrieve each matching record, in their entirety, or a subset of their columns, from the Main Table. 
 To support efficient lookups of multiple rowIDs from the same table, the Accumulo client library provides a BatchScanner. Users specify a set of Ranges to the BatchScanner, which performs the lookups in multiple threads to multiple servers and returns an Iterator over all the rows retrieved. The rows returned are NOT in sorted order, as is the case with the basic Scanner interface. 
-// first we scan the index for IDs of rows matching our query
+// first we scan the index for IDs of rows matching our query
 
 Text term = new Text("mySearchTerm");
 
-HashSet<Text> matchingRows = new HashSet<Text>();
+HashSet<Text> matchingRows = new HashSet<Text>();
 
 Scanner indexScanner = createScanner("index", auths);
-indexScanner.setRange(new Range(term, term));
+indexScanner.setRange(new Range(term, term));
 
-// we retrieve the matching rowIDs and create a set of ranges
-for(Entry<Key,Value> entry : indexScanner)
-    matchingRows.add(new Text(entry.getKey().getColumnQualifier()));
+// we retrieve the matching rowIDs and create a set of ranges
+for(Entry<Key,Value> entry : indexScanner)
+    matchingRows.add(new Text(entry.getKey().getColumnQualifier()));
 
-// now we pass the set of rowIDs to the batch scanner to retrieve them
-BatchScanner bscan = conn.createBatchScanner("table", auths, 10);
+// now we pass the set of rowIDs to the batch scanner to retrieve them
+BatchScanner bscan = conn.createBatchScanner("table", auths, 10);
 
-bscan.setRanges(matchingRows);
-bscan.fetchFamily("attributes");
+bscan.setRanges(matchingRows);
+bscan.fetchFamily("attributes");
 
-for(Entry<Key,Value> entry : scan)
-    System.out.println(entry.getValue());
+for(Entry<Key,Value> entry : scan)
+    System.out.println(entry.getValue());
 
 
 
@@ -195,8 +195,7 @@
 The physical schema for an entity-attribute or graph table is as follows: 
 
 For example, to keep track of employees, managers and products the following entity-attribute table could be used. Note that the weights are not always necessary and are set to 0 when not used. 
- 

-
+   
 To allow efficient updating of edge weights, an aggregating iterator can be configured to add the value of all mutations applied with the same key. These types of tables can easily be created from raw events by simply extracting the entities, attributes, and relationships from individual events and inserting the keys into Accumulo each with a count of 1. The aggregating iterator will take care of maintaining the edge weights. 
  Document-Partitioned Indexing
 Using a simple index as described above works well when looking for records that match one of a set of given criteria. When looking for records that match more than one criterion simultaneously, such as when looking for documents that contain all of the words the' andwhite' and `house', there are several issues. 
@@ -206,16 +205,16 @@
 
 Documents or records are mapped into bins by a user-defined ingest application. By storing the BinID as the RowID we ensure that all the information for a particular bin is contained in a single tablet and hosted on a single TabletServer since Accumulo never splits rows across tablets. Storing the Terms as column families serves to enable fast lookups of all the documents within this bin that contain the given term. 
 Finally, we perform set intersection operations on the TabletServer via a special iterator called the Intersecting Iterator. Since documents are partitioned into many bins, a search of all documents must search every bin. We can use the BatchScanner to scan all bins in parallel. The Intersecting Iterator should be enabled on a BatchScanner within user query code as follows: 
-Text[] terms = {new Text("the"), new Text("white"), new Text("house")};
+Text[] terms = {new Text("the"), new Text("white"), new Text("house")};
 
-BatchScanner bs = conn.createBatchScanner(table, auths, 20);
-IteratorSetting iter = new IteratorSetting(20, "ii", IntersectingIterator.class);
-IntersectingIterator.setColumnFamilies(iter, terms);
-bs.addScanIterator(iter);
-bs.setRanges(Collections.singleton(new Range()));
+BatchScanner bs = conn.createBatchScanner(table, auths, 20);
+IteratorSetting iter = new IteratorSetting(20, "ii", IntersectingIterator.class);
+IntersectingIterator.setColumnFamilies(iter, terms);
+bs.addScanIterator(iter);
+bs.setRanges(Collections.singleton(new Range()));
 
-for(Entry<Key,Value> entry : bs) {
-    System.out.println(" " + entry.getKey().getColumnQualifier());
+for(Entry<Key,Value> entry : bs) {
+    System.out.println(" " + entry.getKey().getColumnQualifier());
 }
 
 
@@ -228,7 +227,7 @@
 
   
     
-      
+      
     
     
       

Modified: websites/staging/accumulo/trunk/content/1.4/user_manual/Writing_Accumulo_Clients.html
==============================================================================
--- websites/staging/accumulo/trunk/content/1.4/user_manual/Writing_Accumulo_Clients.html (original)
+++ websites/staging/accumulo/trunk/content/1.4/user_manual/Writing_Accumulo_Clients.html Thu Jul 18 22:46:37 2013
@@ -17,7 +17,7 @@
     See the License for the specific language governing permissions and
     limitations under the License.
 -->
-  
+  
   Apache Accumulo User Manual: Writing Accumulo Clients
   
   
@@ -48,30 +48,31 @@
   
Project
 
 Home
-Downloads
-Features
+Downloads
+Features
 License
 
 Community
 
-Get Involved
-Mailing Lists
-People
+Get Involved
+Mailing Lists
+People
 
 Development
 
-Source & Guide
+Source & Guide
+Git WIP
 Issues
 Builds
 
 Documentation
 
-Manual 1.4 / 1.5
-Javadoc 1.4 / 1.5
-Examples 1.4 / 1.5
-Screenshots
-Papers & Other Links
-Glossary
+Manual 1.4 / 1.5
+Javadoc 1.4 / 1.5
+Examples 1.4 / 1.5
+Screenshots
+Papers & Other Links
+Glossary
 
 ASF links
 
@@ -83,13 +84,12 @@
   
 
   
-    ™
+    ™
   
 
   
     Apache Accumulo User Manual: Writing Accumulo Clients
-     Next: Table Configuration  Up: Apache Accumulo User Manual Version 1.4  Previous: Accumulo Shell    Contents 

-
+     Next: Table Configuration  Up: Apache Accumulo User Manual Version 1.4  Previous: Accumulo Shell    Contents   
 Subsections
 
 Running Client Code
@@ -107,7 +107,7 @@
 using the tool script 
 
 Inorder to run client code written to run against Accumulo, you will need to include the jars that Accumulo depends on in your classpath. Accumulo client code depends on Hadoop and Zookeeper. For Hadoop add the hadoop core jar, all of the jars in the Hadoop lib directory, and the conf directory to the classpath. For Zookeeper 3.3 you only need to add the Zookeeper jar, and not what is in the Zookeeper lib directory. You can run the following command on a configured Accumulo system to see what its using for its classpath. 
-$ACCUMULO_HOME/bin/accumulo classpath
+$ACCUMULO_HOME/bin/accumulo classpath
 
 
 
@@ -115,43 +115,43 @@
 If you are writing map reduce job that access Accumulo, then you can use the bin/tool.sh script to run those jobs. See the map reduce example. 
  Connecting
 All clients must first identify the Accumulo instance to which they will be communicating. Code to do this is as follows: 
-String instanceName = "myinstance";
-String zooServers = "zooserver-one,zooserver-two"
-Instance inst = new ZooKeeperInstance(instanceName, zooServers);
+String instanceName = "myinstance";
+String zooServers = "zooserver-one,zooserver-two"
+Instance inst = new ZooKeeperInstance(instanceName, zooServers);
 
-Connector conn = inst.getConnector("user", "passwd");
+Connector conn = inst.getConnector("user", "passwd");
 
 
 
  Writing Data
 Data are written to Accumulo by creating Mutation objects that represent all the changes to the columns of a single row. The changes are made atomically in the TabletServer. Clients then add Mutations to a BatchWriter which submits them to the appropriate TabletServers. 
 Mutations can be created thus: 
-Text rowID = new Text("row1");
-Text colFam = new Text("myColFam");
-Text colQual = new Text("myColQual");
-ColumnVisibility colVis = new ColumnVisibility("public");
-long timestamp = System.currentTimeMillis();
+Text rowID = new Text("row1");
+Text colFam = new Text("myColFam");
+Text colQual = new Text("myColQual");
+ColumnVisibility colVis = new ColumnVisibility("public");
+long timestamp = System.currentTimeMillis();
 
-Value value = new Value("myValue".getBytes());
+Value value = new Value("myValue".getBytes());
 
-Mutation mutation = new Mutation(rowID);
-mutation.put(colFam, colQual, colVis, timestamp, value);
+Mutation mutation = new Mutation(rowID);
+mutation.put(colFam, colQual, colVis, timestamp, value);
 
 
 
  BatchWriter
 The BatchWriter is highly optimized to send Mutations to multiple TabletServers and automatically batches Mutations destined for the same TabletServer to amortize network overhead. Care must be taken to avoid changing the contents of any Object passed to the BatchWriter since it keeps objects in memory while batching. 
 Mutations are added to a BatchWriter thus: 
-long memBuf = 1000000L; // bytes to store before sending a batch
-long timeout = 1000L; // milliseconds to wait before sending
-int numThreads = 10;
+long memBuf = 1000000L; // bytes to store before sending a batch
+long timeout = 1000L; // milliseconds to wait before sending
+int numThreads = 10;
 
-BatchWriter writer =
-    conn.createBatchWriter("table", memBuf, timeout, numThreads)
+BatchWriter writer =
+    conn.createBatchWriter("table", memBuf, timeout, numThreads)
 
-writer.add(mutation);
+writer.add(mutation);
 
-writer.close();
+writer.close();
 
 
 
@@ -161,18 +161,18 @@ accumulo/docs/examples/README.batch 
 Accumulo is optimized to quickly retrieve the value associated with a given key, and to efficiently return ranges of consecutive keys and their associated values. 
  Scanner
 To retrieve data, Clients use a Scanner, which provides acts like an Iterator over keys and values. Scanners can be configured to start and stop at particular keys, and to return a subset of the columns available. 
-// specify which visibilities we are allowed to see
+// specify which visibilities we are allowed to see
 Authorizations auths = new Authorizations("public");
 
 Scanner scan =
-    conn.createScanner("table", auths);
+    conn.createScanner("table", auths);
 
-scan.setRange(new Range("harry","john"));
-scan.fetchFamily("attributes");
+scan.setRange(new Range("harry","john"));
+scan.fetchFamily("attributes");
 
-for(Entry<Key,Value> entry : scan) {
-    String row = e.getKey().getRow();
-    Value value = e.getValue();
+for(Entry<Key,Value> entry : scan) {
+    String row = e.getKey().getRow();
+    Value value = e.getValue();
 }
 
 
@@ -191,17 +191,17 @@ src/examples/src/main/java/org/apache/ac
  BatchScanner
 For some types of access, it is more efficient to retrieve several ranges simultaneously. This arises when accessing a set of rows that are not consecutive whose IDs have been retrieved from a secondary index, for example. 
 The BatchScanner is configured similarly to the Scanner; it can be configured to retrieve a subset of the columns available, but rather than passing a single Range, BatchScanners accept a set of Ranges. It is important to note that the keys returned by a BatchScanner are not in sorted order since the keys streamed are from multiple TabletServers in parallel. 
-ArrayList<Range> ranges = new ArrayList<Range>();
-// populate list of ranges ...
+ArrayList<Range> ranges = new ArrayList<Range>();
+// populate list of ranges ...
 
-BatchScanner bscan =
-    conn.createBatchScanner("table", auths, 10);
+BatchScanner bscan =
+    conn.createBatchScanner("table", auths, 10);
 
-bscan.setRanges(ranges);
-bscan.fetchFamily("attributes");
+bscan.setRanges(ranges);
+bscan.fetchFamily("attributes");
 
-for(Entry<Key,Value> entry : scan)
-    System.out.println(e.getValue());
+for(Entry<Key,Value> entry : scan)
+    System.out.println(e.getValue());
 
 
 
@@ -213,7 +213,7 @@ accumulo/docs/examples/README.batch 
 
   
     
-      
+      
     
     
       

Modified: websites/staging/accumulo/trunk/content/1.4/user_manual/accumulo_user_manual.html
==============================================================================
--- websites/staging/accumulo/trunk/content/1.4/user_manual/accumulo_user_manual.html (original)
+++ websites/staging/accumulo/trunk/content/1.4/user_manual/accumulo_user_manual.html Thu Jul 18 22:46:37 2013
@@ -17,7 +17,7 @@
     See the License for the specific language governing permissions and
     limitations under the License.
 -->
-  
+  
   Apache Accumulo User Manual: index
   
   
@@ -48,30 +48,31 @@
   
Project
 
 Home
-Downloads
-Features
+Downloads
+Features
 License
 
 Community
 
-Get Involved
-Mailing Lists
-People
+Get Involved
+Mailing Lists
+People
 
 Development
 
-Source & Guide
+Source & Guide
+Git WIP
 Issues
 Builds
 
 Documentation
 
-Manual 1.4 / 1.5
-Javadoc 1.4 / 1.5
-Examples 1.4 / 1.5
-Screenshots
-Papers & Other Links
-Glossary
+Manual 1.4 / 1.5
+Javadoc 1.4 / 1.5
+Examples 1.4 / 1.5
+Screenshots
+Papers & Other Links
+Glossary
 
 ASF links
 
@@ -83,13 +84,12 @@
   
 
   
-    ™
+    ™
   
 
   
     Apache Accumulo User Manual: index
-     Next: Contents    Contents 

-
+     Next: Contents    Contents   
 Apache Accumulo User Manual
 Version 1.4
 
@@ -113,7 +113,7 @@
 
   
     
-      
+      
     
     
       

Modified: websites/staging/accumulo/trunk/content/1.4/user_manual/index.html
==============================================================================
--- websites/staging/accumulo/trunk/content/1.4/user_manual/index.html (original)
+++ websites/staging/accumulo/trunk/content/1.4/user_manual/index.html Thu Jul 18 22:46:37 2013
@@ -17,7 +17,7 @@
     See the License for the specific language governing permissions and
     limitations under the License.
 -->
-  
+  
   Apache Accumulo User Manual: index
   
   
@@ -48,30 +48,31 @@
   
Project
 
 Home
-Downloads
-Features
+Downloads
+Features
 License
 
 Community
 
-Get Involved
-Mailing Lists
-People
+Get Involved
+Mailing Lists
+People
 
 Development
 
-Source & Guide
+Source & Guide
+Git WIP
 Issues
 Builds
 
 Documentation
 
-Manual 1.4 / 1.5
-Javadoc 1.4 / 1.5
-Examples 1.4 / 1.5
-Screenshots
-Papers & Other Links
-Glossary
+Manual 1.4 / 1.5
+Javadoc 1.4 / 1.5
+Examples 1.4 / 1.5
+Screenshots
+Papers & Other Links
+Glossary
 
 ASF links
 
@@ -83,13 +84,12 @@
   
 
   
-    ™
+    ™
   
 
   
     Apache Accumulo User Manual: index
-     Next: Contents    Contents 

-
+     Next: Contents    Contents   
 Apache Accumulo User Manual
 Version 1.4
 
@@ -113,7 +113,7 @@
 
   
     
-      
+      
     
     
       

Modified: websites/staging/accumulo/trunk/content/1.5/examples/batch.html
==============================================================================
--- websites/staging/accumulo/trunk/content/1.5/examples/batch.html (original)
+++ websites/staging/accumulo/trunk/content/1.5/examples/batch.html Thu Jul 18 22:46:37 2013
@@ -17,7 +17,7 @@
     See the License for the specific language governing permissions and
     limitations under the License.
 -->
-  
+  
   Apache Accumulo Batch Writing and Scanning Example
   
   
@@ -48,30 +48,31 @@
   
Project
 
 Home
-Downloads
-Features
+Downloads
+Features
 License
 
 Community
 
-Get Involved
-Mailing Lists
-People
+Get Involved
+Mailing Lists
+People
 
 Development
 
-Source & Guide
+Source & Guide
+Git WIP
 Issues
 Builds
 
 Documentation
 
-Manual 1.4 / 1.5
-Javadoc 1.4 / 1.5
-Examples 1.4 / 1.5
-Screenshots
-Papers & Other Links
-Glossary
+Manual 1.4 / 1.5
+Javadoc 1.4 / 1.5
+Examples 1.4 / 1.5
+Screenshots
+Papers & Other Links
+Glossary
 
 ASF links
 
@@ -83,7 +84,7 @@
   
 
   
-    ™
+    ™
   
 
   
@@ -103,31 +104,31 @@ queries.  The write command generates ra
 list of zookeeper nodes (given as zookeepers here).
 Before you run this, you must ensure that the user you are running has the
 "exampleVis" authorization. (you can set this in the shell with "setauths -u username -s exampleVis")
-$ ./bin/accumulo shell -u root -e "setauths -u username -s exampleVis"
+$ ./bin/accumulo shell -u root -e "setauths -u username -s exampleVis"
 
 
 
 You must also create the table, batchtest1, ahead of time. (In the shell, use "createtable batchtest1")
-$ ./bin/accumulo shell -u username -e "createtable batchtest1"
-$ ./bin/accumulo org.apache.accumulo.examples.simple.client.SequentialBatchWriter -i instance -z zookeepers -u username -p password -t batchtest1 --start 0 --num 10000 --size 50 --batchMemory 20M --batchLatency 500 --batchThreads 20 --vis exampleVis
-$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner -i instance -z zookeepers -u username -p password -t batchtest1 --num 100 --min 0 --max 10000 --size 50 --scanThreads 20 --vis exampleVis
-07 11:33:11,103 [client.CountingVerifyingReceiver] INFO : Generating 100 random queries...
-07 11:33:11,112 [client.CountingVerifyingReceiver] INFO : finished
-07 11:33:11,260 [client.CountingVerifyingReceiver] INFO : 694.44 lookups/sec   0.14 secs
-
-07 11:33:11,260 [client.CountingVerifyingReceiver] INFO : num results : 100
-
-07 11:33:11,364 [client.CountingVerifyingReceiver] INFO : Generating 100 random queries...
-07 11:33:11,370 [client.CountingVerifyingReceiver] INFO : finished
-07 11:33:11,416 [client.CountingVerifyingReceiver] INFO : 2173.91 lookups/sec   0.05 secs
+$ ./bin/accumulo shell -u username -e "createtable batchtest1"
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.SequentialBatchWriter -i instance -z zookeepers -u username -p password -t batchtest1 --start 0 --num 10000 
 --size 50 --batchMemory 20M --batchLatency 500 --batchThreads 20 --vis exampleVis
+$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner -i instance -z zookeepers -u username -p password -t batchtest1 --num 100 --min 0 --max 10000 --size 50 --scanThreads 20 --vis exampleVis
+07 11:33:11,103 [client.CountingVerifyingReceiver] INFO : Generating 100 random queries...
+07 11:33:11,112 [client.CountingVerifyingReceiver] INFO : finished
+07 11:33:11,260 [client.CountingVerifyingReceiver] INFO : 694.44 lookups/sec   0.14 secs
+
+07 11:33:11,260 [client.CountingVerifyingReceiver] INFO : num results : 100
+
+07 11:33:11,364 [client.CountingVerifyingReceiver] INFO : Generating 100 random queries...
+07 11:33:11,370 [client.CountingVerifyingReceiver] INFO : finished
+07 11:33:11,416 [client.CountingVerifyingReceiver] INFO : 2173.91 lookups/sec   0.05 secs
 
-07 11:33:11,416 [client.CountingVerifyingReceiver] INFO : num results : 100
+07 11:33:11,416 [client.CountingVerifyingReceiver] INFO : num results : 100
 
   
 
   
     
-      
+