accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jamie Johnson <jej2...@gmail.com>
Subject Re: scanner question in regards to columns loaded
Date Sun, 26 Jan 2014 12:55:51 GMT
I mean if a user asked for all terms that started with "term" is there a
way to get term1 and term2 just once while scanning or would I get each
twice, once for each docid and need to filter client side?
On Jan 26, 2014 1:33 AM, "Christopher" <ctubbsii@apache.org> wrote:

> If you use the Range constructor that takes two arguments, then yes,
> you'd get two entries. However, "count" would come before "doc_id",
> though, because the qualifier is part of the Key, and therefore, part
> of the sort order. There's also a Range constructor that allows you to
> specify whether you want the startKey and endKey to be inclusive or
> exclusive.
>
> I don't know of a specific document that outlines various strategies
> that I can link to. Perhaps I'll put one together, when I get some
> spare time, if nobody else does. I think most people do a lot of
> experimentation to figure out which strategies work best.
>
> I'm not entirely sure what you mean about "getting an iterator over
> all terms without duplicates". I'm assuming you don't mean duplicate
> versions of a single entry, which is handled by the
> VersioningIterator, which should be on new tables by default, and set
> to retain the recent 1 version, to support updates. With the scheme I
> suggested, your table would look something like the following,
> instead:
>
> RowID                       ColumnFamily     Column Qualifier     Value
> <term1>=<doc_id1>   index                  count                     10
> <term1>=<doc_id2>   index                  count                     5
> <term2>=<doc_id3>   index                  count                     3
> <term3>=<doc_id1>   index                  count                     12
>
> With this scheme, you'd have only a single entry (a count) for each
> row, and a single row for each term/document combination, so you
> wouldn't have any duplicate counts for any given term/document. If
> that's what you mean by duplicates...
>
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Sat, Jan 25, 2014 at 12:19 AM, Jamie Johnson <jej2003@gmail.com> wrote:
> > Thanks for the reply Chris.  Say I had the following
> >
> > RowID     ColumnFamily     Column Qualifier     Value
> > term         Occurrence~1     doc_id                    1
> > term         Occurrence~1     count                      10
> > term2       Occurrence~2      doc_id                     2
> > term2       Occurrence~2      count                      1
> >
> > creating a scanner with start key new Key(new Text("term"), new
> > Text("Occurrence~1")) and end key new Key(new Text("term"), new
> > Text("Occurrence~1")) I would get an iterator with two entries, the first
> > key would be doc_id and the second would be count.  Is that accurate?
> >
> > In regards to the other strategies is there anywhere that some of these
> are
> > captured?  Also in the your example, how would you go about getting an
> > iterator over all terms without duplicates?  Again thanks
> >
> >
> > On Fri, Jan 24, 2014 at 11:34 PM, Christopher <ctubbsii@apache.org>
> wrote:
> >>
> >> It's not quite clear what you mean by "load", but I think you mean
> >> "iterate over"?
> >>
> >> A simplified explanation is this:
> >>
> >> When you scan an Accumulo table, you are streaming each entry
> >> (Key/Value pair), one at a time, through your client code. They are
> >> only held in memory if you do that yourself in your client code. A row
> >> in Accumulo is the set of entries that share a particular value of the
> >> Row portion of the Key. They are logically grouped, but are not
> >> grouped in memory unless you do that.
> >>
> >> One additional note is regarding your index schema of a row being a
> >> search term and columns being documents. You will likely have issues
> >> with this strategy, as the number of documents for high frequency
> >> terms grows, because tablets do not split in the middle of a row. With
> >> your schema, a row could get too large to manage on a single tablet
> >> server. A slight variation, like concatenating the search term with a
> >> document identifier in the row (term=doc1, term=doc2, ....) would
> >> allow the high frequency terms to split into multiple tablets if they
> >> get too large. There are better strategies, but that's just one simple
> >> option.
> >>
> >>
> >> --
> >> Christopher L Tubbs II
> >> http://gravatar.com/ctubbsii
> >>
> >>
> >> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <jej2003@gmail.com>
> wrote:
> >> > If I have a row that as the key is a particular term and a set of
> >> > columns
> >> > that stores the documents that the term appears in if I load the row
> is
> >> > the
> >> > contents of all of the columns also loaded?  Is there a way to page
> over
> >> > the
> >> > columns such that only N columns are in memory at any point?  In this
> >> > particular case the documents are all in a particular column family
> (say
> >> > docs) and the column qualifier is created dynamically, for arguments
> >> > sake we
> >> > can say they are UUIDs.
> >
> >
>

Mime
View raw message