accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <>
Subject Re: scanner question in regards to columns loaded
Date Sun, 26 Jan 2014 06:32:46 GMT
If you use the Range constructor that takes two arguments, then yes,
you'd get two entries. However, "count" would come before "doc_id",
though, because the qualifier is part of the Key, and therefore, part
of the sort order. There's also a Range constructor that allows you to
specify whether you want the startKey and endKey to be inclusive or

I don't know of a specific document that outlines various strategies
that I can link to. Perhaps I'll put one together, when I get some
spare time, if nobody else does. I think most people do a lot of
experimentation to figure out which strategies work best.

I'm not entirely sure what you mean about "getting an iterator over
all terms without duplicates". I'm assuming you don't mean duplicate
versions of a single entry, which is handled by the
VersioningIterator, which should be on new tables by default, and set
to retain the recent 1 version, to support updates. With the scheme I
suggested, your table would look something like the following,

RowID                       ColumnFamily     Column Qualifier     Value
<term1>=<doc_id1>   index                  count                     10
<term1>=<doc_id2>   index                  count                     5
<term2>=<doc_id3>   index                  count                     3
<term3>=<doc_id1>   index                  count                     12

With this scheme, you'd have only a single entry (a count) for each
row, and a single row for each term/document combination, so you
wouldn't have any duplicate counts for any given term/document. If
that's what you mean by duplicates...

Christopher L Tubbs II

On Sat, Jan 25, 2014 at 12:19 AM, Jamie Johnson <> wrote:
> Thanks for the reply Chris.  Say I had the following
> RowID     ColumnFamily     Column Qualifier     Value
> term         Occurrence~1     doc_id                    1
> term         Occurrence~1     count                      10
> term2       Occurrence~2      doc_id                     2
> term2       Occurrence~2      count                      1
> creating a scanner with start key new Key(new Text("term"), new
> Text("Occurrence~1")) and end key new Key(new Text("term"), new
> Text("Occurrence~1")) I would get an iterator with two entries, the first
> key would be doc_id and the second would be count.  Is that accurate?
> In regards to the other strategies is there anywhere that some of these are
> captured?  Also in the your example, how would you go about getting an
> iterator over all terms without duplicates?  Again thanks
> On Fri, Jan 24, 2014 at 11:34 PM, Christopher <> wrote:
>> It's not quite clear what you mean by "load", but I think you mean
>> "iterate over"?
>> A simplified explanation is this:
>> When you scan an Accumulo table, you are streaming each entry
>> (Key/Value pair), one at a time, through your client code. They are
>> only held in memory if you do that yourself in your client code. A row
>> in Accumulo is the set of entries that share a particular value of the
>> Row portion of the Key. They are logically grouped, but are not
>> grouped in memory unless you do that.
>> One additional note is regarding your index schema of a row being a
>> search term and columns being documents. You will likely have issues
>> with this strategy, as the number of documents for high frequency
>> terms grows, because tablets do not split in the middle of a row. With
>> your schema, a row could get too large to manage on a single tablet
>> server. A slight variation, like concatenating the search term with a
>> document identifier in the row (term=doc1, term=doc2, ....) would
>> allow the high frequency terms to split into multiple tablets if they
>> get too large. There are better strategies, but that's just one simple
>> option.
>> --
>> Christopher L Tubbs II
>> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <> wrote:
>> > If I have a row that as the key is a particular term and a set of
>> > columns
>> > that stores the documents that the term appears in if I load the row is
>> > the
>> > contents of all of the columns also loaded?  Is there a way to page over
>> > the
>> > columns such that only N columns are in memory at any point?  In this
>> > particular case the documents are all in a particular column family (say
>> > docs) and the column qualifier is created dynamically, for arguments
>> > sake we
>> > can say they are UUIDs.

View raw message