incubator-accumulo-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From e..@apache.org
Subject svn commit: r1241136 - /incubator/accumulo/site/trunk/content/accumulo/example/wikisearch.mdtext
Date Mon, 06 Feb 2012 19:59:04 GMT
Author: ecn
Date: Mon Feb  6 19:59:04 2012
New Revision: 1241136

URL: http://svn.apache.org/viewvc?rev=1241136&view=rev
Log:
region server -> tablet server, add cardinality of "the"

Modified:
    incubator/accumulo/site/trunk/content/accumulo/example/wikisearch.mdtext

Modified: incubator/accumulo/site/trunk/content/accumulo/example/wikisearch.mdtext
URL: http://svn.apache.org/viewvc/incubator/accumulo/site/trunk/content/accumulo/example/wikisearch.mdtext?rev=1241136&r1=1241135&r2=1241136&view=diff
==============================================================================
--- incubator/accumulo/site/trunk/content/accumulo/example/wikisearch.mdtext (original)
+++ incubator/accumulo/site/trunk/content/accumulo/example/wikisearch.mdtext Mon Feb  6 19:59:04
2012
@@ -16,6 +16,7 @@ Notice:    Licensed to the Apache Softwa
            specific language governing permissions and limitations
            under the License.
 
+
 Accumulo Query Performance
 --------------------------
 
@@ -26,9 +27,9 @@ Starting with release 1.4, Accumulo incl
 
 The example uses an indexing technique helpful for doing multiple logical tests against content.
 In this case, we can perform a word search on Wikipedia articles.   The sample application
takes advantage of 3 unique capabilities of Accumulo:
 
- 1. Extensible iterators that operate within the distributed region servers of the key-value
store
+ 1. Extensible iterators that operate within the distributed tablet servers of the key-value
store
  2. Custom aggregators which can efficiently condense information during the various life-cycles
of the log-structured merge tree 
- 3. Custom load balancing, which ensures that a table is evenly distributed on all region
servers
+ 3. Custom load balancing, which ensures that a table is evenly distributed on all tablet
servers
 
 In the example, Accumulo tracks the cardinality of all terms as elements are ingested.  If
the cardinality is small enough, it will track the set of documents by term directly.  For
example:
 
@@ -66,7 +67,7 @@ td {
 </tr>
 </table>
 
-Searches can be optimized to focus on low-cardinality terms.  To create these counts, the
example installs “aggregators” which are used to combine inserted values.  The ingester
just writes simple  “(Octopus, 1, Document 57)” tuples.  The region servers then
used the installed aggregators to merge the cells as the data is re-written, or queried. 
This reduces the in-memory locking required to update high-cardinality terms, and defers aggregation
to a later time, where it can be done more efficiently.
+Searches can be optimized to focus on low-cardinality terms.  To create these counts, the
example installs “aggregators” which are used to combine inserted values.  The ingester
just writes simple  “(Octopus, 1, Document 57)” tuples.  The tablet servers then
used the installed aggregators to merge the cells as the data is re-written, or queried. 
This reduces the in-memory locking required to update high-cardinality terms, and defers aggregation
to a later time, where it can be done more efficiently.
 
 The example also creates a reverse word index to map each word to the document in which it
appears. But it does this by choosing an arbitrary partition for the document.  The article,
and the word index for the article are grouped together into the same partition.  For example:
 
@@ -115,7 +116,7 @@ The example also creates a reverse word 
 
 Of course, there would be large numbers of documents in each partition, and the elements
of those documents would be interlaced according to their sort order.
 
-By dividing the index space into partitions, the multi-word searches can be performed in
parallel across all the nodes.  Also, by grouping the document together with its index, a
document can be retrieved without a second request from the client.  The query “octopus”
and “big” will be performed on all the servers, but only those partitions for which
the low-cardinality term “octopus” can be found by using the aggregated reverse
index information.  The query for a document is performed by extensions provided in the example.
 These extensions become part of the region servers iterator stack.  By cloning the underlying
iterators, the query extensions can seek to specific words within the index, and when it finds
a matching document, it can then seek to the document location and retrieve the contents.
+By dividing the index space into partitions, the multi-word searches can be performed in
parallel across all the nodes.  Also, by grouping the document together with its index, a
document can be retrieved without a second request from the client.  The query “octopus”
and “big” will be performed on all the servers, but only those partitions for which
the low-cardinality term “octopus” can be found by using the aggregated reverse
index information.  The query for a document is performed by extensions provided in the example.
 These extensions become part of the tablet server's iterator stack.  By cloning the underlying
iterators, the query extensions can seek to specific words within the index, and when it finds
a matching document, it can then seek to the document location and retrieve the contents.
 
 We loaded the example on a  cluster of 10 servers, each with 12 cores, and 32G RAM, 6 500G
drives.  Accumulo tablet servers were allowed a maximum of 3G of working memory, of which
2G was dedicated to caching file data.
 
@@ -218,9 +219,9 @@ We performed the following queries, and 
 <td>481,531
 </table>
 
-Because the terms are tested together within the region server, even fairly high-cardinality
terms such as “old,” “man,” and “sea” can be tested efficiently,
without needing to return to the client, or make distributed calls between servers to perform
the intersection between terms.
+Because the terms are tested together within the tablet server, even fairly high-cardinality
terms such as “old,” “man,” and “sea” can be tested efficiently,
without needing to return to the client, or make distributed calls between servers to perform
the intersection between terms.
 
-For reference, here are the cardinalities for all the terms in the query (remember, this
is across all languages loaded:
+For reference, here are the cardinalities for all the terms in the query (remember, this
is across all languages loaded):
 
 <table>
 <tr> <th>Term <th> Cardinality
@@ -240,6 +241,7 @@ For reference, here are the cardinalitie
 <tr> <td> sea <td> 247,231
 <tr> <td> slashdot <td> 2,343
 <tr> <td> spring <td> 125,605
+<tr> <td> the <td> 3509498
 <tr> <td> three <td> 718,810
 </table>
 



Mime
View raw message