incubator-accumulo-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r804019 - /websites/staging/accumulo/trunk/content/accumulo/example/wikisearch.html
Date Mon, 06 Feb 2012 19:59:13 GMT
Author: buildbot
Date: Mon Feb  6 19:59:13 2012
New Revision: 804019

Log:
Staging update by buildbot for accumulo

Modified:
    websites/staging/accumulo/trunk/content/accumulo/example/wikisearch.html

Modified: websites/staging/accumulo/trunk/content/accumulo/example/wikisearch.html
==============================================================================
--- websites/staging/accumulo/trunk/content/accumulo/example/wikisearch.html (original)
+++ websites/staging/accumulo/trunk/content/accumulo/example/wikisearch.html Mon Feb  6 19:59:13
2012
@@ -97,9 +97,9 @@
 <p>Starting with release 1.4, Accumulo includes an example web application that provides
a flexible,  scalable search over the articles of Wikipedia, a widely available medium-sized
corpus.</p>
 <p>The example uses an indexing technique helpful for doing multiple logical tests
against content.  In this case, we can perform a word search on Wikipedia articles.   The
sample application takes advantage of 3 unique capabilities of Accumulo:</p>
 <ol>
-<li>Extensible iterators that operate within the distributed region servers of the
key-value store</li>
+<li>Extensible iterators that operate within the distributed tablet servers of the
key-value store</li>
 <li>Custom aggregators which can efficiently condense information during the various
life-cycles of the log-structured merge tree </li>
-<li>Custom load balancing, which ensures that a table is evenly distributed on all
region servers</li>
+<li>Custom load balancing, which ensures that a table is evenly distributed on all
tablet servers</li>
 </ol>
 <p>In the example, Accumulo tracks the cardinality of all terms as elements are ingested.
 If the cardinality is small enough, it will track the set of documents by term directly.
 For example:</p>
 <style type="text/css">
@@ -136,7 +136,7 @@ td {
 </tr>
 </table>
 
-<p>Searches can be optimized to focus on low-cardinality terms.  To create these counts,
the example installs “aggregators” which are used to combine inserted values.  The
ingester just writes simple  “(Octopus, 1, Document 57)” tuples.  The region servers
then used the installed aggregators to merge the cells as the data is re-written, or queried.
 This reduces the in-memory locking required to update high-cardinality terms, and defers
aggregation to a later time, where it can be done more efficiently.</p>
+<p>Searches can be optimized to focus on low-cardinality terms.  To create these counts,
the example installs “aggregators” which are used to combine inserted values.  The
ingester just writes simple  “(Octopus, 1, Document 57)” tuples.  The tablet servers
then used the installed aggregators to merge the cells as the data is re-written, or queried.
 This reduces the in-memory locking required to update high-cardinality terms, and defers
aggregation to a later time, where it can be done more efficiently.</p>
 <p>The example also creates a reverse word index to map each word to the document in
which it appears. But it does this by choosing an arbitrary partition for the document.  The
article, and the word index for the article are grouped together into the same partition.
 For example:</p>
 <table>
 <tr>
@@ -182,7 +182,7 @@ td {
 </table>
 
 <p>Of course, there would be large numbers of documents in each partition, and the
elements of those documents would be interlaced according to their sort order.</p>
-<p>By dividing the index space into partitions, the multi-word searches can be performed
in parallel across all the nodes.  Also, by grouping the document together with its index,
a document can be retrieved without a second request from the client.  The query “octopus”
and “big” will be performed on all the servers, but only those partitions for which
the low-cardinality term “octopus” can be found by using the aggregated reverse
index information.  The query for a document is performed by extensions provided in the example.
 These extensions become part of the region servers iterator stack.  By cloning the underlying
iterators, the query extensions can seek to specific words within the index, and when it finds
a matching document, it can then seek to the document location and retrieve the contents.</p>
+<p>By dividing the index space into partitions, the multi-word searches can be performed
in parallel across all the nodes.  Also, by grouping the document together with its index,
a document can be retrieved without a second request from the client.  The query “octopus”
and “big” will be performed on all the servers, but only those partitions for which
the low-cardinality term “octopus” can be found by using the aggregated reverse
index information.  The query for a document is performed by extensions provided in the example.
 These extensions become part of the tablet server's iterator stack.  By cloning the underlying
iterators, the query extensions can seek to specific words within the index, and when it finds
a matching document, it can then seek to the document location and retrieve the contents.</p>
 <p>We loaded the example on a  cluster of 10 servers, each with 12 cores, and 32G RAM,
6 500G drives.  Accumulo tablet servers were allowed a maximum of 3G of working memory, of
which 2G was dedicated to caching file data.</p>
 <p>Following the instructions in the example, the Wikipedia XML data for articles was
loaded for English, Spanish and German languages into 10 partitions.  The data is not partitioned
by language: multiple languages were used to get a larger set of test data.  The data load
took around 8 hours, and has not been optimized for scale.  Once the data was loaded, the
content was compacted which took about 35 minutes.</p>
 <p>The example uses the language-specific tokenizers available from the Apache Lucene
project for Wikipedia data.</p>
@@ -279,8 +279,8 @@ td {
 <td>481,531
 </table>
 
-<p>Because the terms are tested together within the region server, even fairly high-cardinality
terms such as “old,” “man,” and “sea” can be tested efficiently,
without needing to return to the client, or make distributed calls between servers to perform
the intersection between terms.</p>
-<p>For reference, here are the cardinalities for all the terms in the query (remember,
this is across all languages loaded:</p>
+<p>Because the terms are tested together within the tablet server, even fairly high-cardinality
terms such as “old,” “man,” and “sea” can be tested efficiently,
without needing to return to the client, or make distributed calls between servers to perform
the intersection between terms.</p>
+<p>For reference, here are the cardinalities for all the terms in the query (remember,
this is across all languages loaded):</p>
 <table>
 <tr> <th>Term <th> Cardinality
 <tr> <td> ducky <td> 795
@@ -299,6 +299,7 @@ td {
 <tr> <td> sea <td> 247,231
 <tr> <td> slashdot <td> 2,343
 <tr> <td> spring <td> 125,605
+<tr> <td> the <td> 3509498
 <tr> <td> three <td> 718,810
 </table>
 



Mime
View raw message