accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <>
Subject Re: scanner question in regards to columns loaded
Date Sat, 25 Jan 2014 04:34:45 GMT
It's not quite clear what you mean by "load", but I think you mean
"iterate over"?

A simplified explanation is this:

When you scan an Accumulo table, you are streaming each entry
(Key/Value pair), one at a time, through your client code. They are
only held in memory if you do that yourself in your client code. A row
in Accumulo is the set of entries that share a particular value of the
Row portion of the Key. They are logically grouped, but are not
grouped in memory unless you do that.

One additional note is regarding your index schema of a row being a
search term and columns being documents. You will likely have issues
with this strategy, as the number of documents for high frequency
terms grows, because tablets do not split in the middle of a row. With
your schema, a row could get too large to manage on a single tablet
server. A slight variation, like concatenating the search term with a
document identifier in the row (term=doc1, term=doc2, ....) would
allow the high frequency terms to split into multiple tablets if they
get too large. There are better strategies, but that's just one simple

Christopher L Tubbs II

On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <> wrote:
> If I have a row that as the key is a particular term and a set of columns
> that stores the documents that the term appears in if I load the row is the
> contents of all of the columns also loaded?  Is there a way to page over the
> columns such that only N columns are in memory at any point?  In this
> particular case the documents are all in a particular column family (say
> docs) and the column qualifier is created dynamically, for arguments sake we
> can say they are UUIDs.

View raw message