accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Slacum <wilhelm.von.cl...@accumulo.net>
Subject Re: scanner question in regards to columns loaded
Date Mon, 27 Jan 2014 02:57:42 GMT
Filters (and more generally, iterators) are executed on the server. There
is an option to run them client side. See
http://accumulo.apache.org/1.4/apidocs/org/apache/accumulo/core/client/ClientSideIteratorScanner.html

Using fetchColumnFamily will return only keys that have specific column
family values, not rows.

If I have a few keys in a table:

row1 family1: qualifier1
row1 family2: qualifier2
row2 family1: qualifier1

Let's say I call `scanner.fetchColumnFamily("family1")`. My scanner will
return:

row1 family1: qualifier1
row2 family1: qualifier1

Now let's say I want to do a scan, but call
`scanner.fetchColumnFamily("family2")`. My scanner will return:

row1 family2: qualifier2

If you want whole rows that contain specific column families, then I
believe you'd have to write a custom iterator using the RowFilter
http://accumulo.apache.org/1.4/apidocs/org/apache/accumulo/core/iterators/user/RowFilter.html


On Sun, Jan 26, 2014 at 7:39 PM, Jamie Johnson <jej2003@gmail.com> wrote:

> After a little reading...if I use fetchColumnFamily does that skip any
> rows that does not have the column family?
> On Jan 26, 2014 7:27 PM, "Jamie Johnson" <jej2003@gmail.com> wrote:
>
>> Thanks for the ideas.  Filters are client side right?
>>
>> I need to read the documentation more as I don't know how to just query a
>> column family.  Would it be possible to get all terms that start with a
>> particular value?  I was thinking that we would need a special prefix for
>> this but if something could be done without needing it that would work well.
>> On Jan 26, 2014 5:44 PM, "Christopher" <ctubbsii@apache.org> wrote:
>>
>>> Ah, I see. Well, you could do that with a custom filter (iterator),
>>> but otherwise, no, not unless you had some other special per-term
>>> entry to query (rather than per-term/document pair). The design of
>>> this kind of table though, seems focused on finding documents which
>>> contain the given terms, though, not listing all terms seen. If you
>>> need that additional feature and don't want to write a custom filter,
>>> you could achieve that by putting a special entry in its own row for
>>> each term, in addition to the entries per-term/document pair, as in:
>>>
>>> RowID                       ColumnFamily     Column Qualifier     Value
>>> <term1>                    term                   -
>>>        -
>>> <term1>=<doc_id2>   index                  count                
    5
>>>
>>> Then, you could list terms by querying the "term" column family
>>> without getting duplicates. And, you could get decent performance with
>>> this scan if you put the "term" column family and the "index" column
>>> family in separate locality groups. You could even make this entry an
>>> aggregated count for all documents (see documentation for combiners),
>>> in case you want corpus-wide term frequencies (for something like
>>> TF-IDF computations).
>>>
>>> --
>>> Christopher L Tubbs II
>>> http://gravatar.com/ctubbsii
>>>
>>>
>>> On Sun, Jan 26, 2014 at 7:55 AM, Jamie Johnson <jej2003@gmail.com>
>>> wrote:
>>> > I mean if a user asked for all terms that started with "term" is there
>>> a way
>>> > to get term1 and term2 just once while scanning or would I get each
>>> twice,
>>> > once for each docid and need to filter client side?
>>> >
>>> > On Jan 26, 2014 1:33 AM, "Christopher" <ctubbsii@apache.org> wrote:
>>> >>
>>> >> If you use the Range constructor that takes two arguments, then yes,
>>> >> you'd get two entries. However, "count" would come before "doc_id",
>>> >> though, because the qualifier is part of the Key, and therefore, part
>>> >> of the sort order. There's also a Range constructor that allows you
to
>>> >> specify whether you want the startKey and endKey to be inclusive or
>>> >> exclusive.
>>> >>
>>> >> I don't know of a specific document that outlines various strategies
>>> >> that I can link to. Perhaps I'll put one together, when I get some
>>> >> spare time, if nobody else does. I think most people do a lot of
>>> >> experimentation to figure out which strategies work best.
>>> >>
>>> >> I'm not entirely sure what you mean about "getting an iterator over
>>> >> all terms without duplicates". I'm assuming you don't mean duplicate
>>> >> versions of a single entry, which is handled by the
>>> >> VersioningIterator, which should be on new tables by default, and set
>>> >> to retain the recent 1 version, to support updates. With the scheme
I
>>> >> suggested, your table would look something like the following,
>>> >> instead:
>>> >>
>>> >> RowID                       ColumnFamily     Column Qualifier
>>> Value
>>> >> <term1>=<doc_id1>   index                  count
>>> 10
>>> >> <term1>=<doc_id2>   index                  count       
             5
>>> >> <term2>=<doc_id3>   index                  count       
             3
>>> >> <term3>=<doc_id1>   index                  count
>>> 12
>>> >>
>>> >> With this scheme, you'd have only a single entry (a count) for each
>>> >> row, and a single row for each term/document combination, so you
>>> >> wouldn't have any duplicate counts for any given term/document. If
>>> >> that's what you mean by duplicates...
>>> >>
>>> >>
>>> >> --
>>> >> Christopher L Tubbs II
>>> >> http://gravatar.com/ctubbsii
>>> >>
>>> >>
>>> >> On Sat, Jan 25, 2014 at 12:19 AM, Jamie Johnson <jej2003@gmail.com>
>>> wrote:
>>> >> > Thanks for the reply Chris.  Say I had the following
>>> >> >
>>> >> > RowID     ColumnFamily     Column Qualifier     Value
>>> >> > term         Occurrence~1     doc_id                    1
>>> >> > term         Occurrence~1     count                      10
>>> >> > term2       Occurrence~2      doc_id                     2
>>> >> > term2       Occurrence~2      count                      1
>>> >> >
>>> >> > creating a scanner with start key new Key(new Text("term"), new
>>> >> > Text("Occurrence~1")) and end key new Key(new Text("term"), new
>>> >> > Text("Occurrence~1")) I would get an iterator with two entries,
the
>>> >> > first
>>> >> > key would be doc_id and the second would be count.  Is that
>>> accurate?
>>> >> >
>>> >> > In regards to the other strategies is there anywhere that some
of
>>> these
>>> >> > are
>>> >> > captured?  Also in the your example, how would you go about getting
>>> an
>>> >> > iterator over all terms without duplicates?  Again thanks
>>> >> >
>>> >> >
>>> >> > On Fri, Jan 24, 2014 at 11:34 PM, Christopher <ctubbsii@apache.org>
>>> >> > wrote:
>>> >> >>
>>> >> >> It's not quite clear what you mean by "load", but I think you
mean
>>> >> >> "iterate over"?
>>> >> >>
>>> >> >> A simplified explanation is this:
>>> >> >>
>>> >> >> When you scan an Accumulo table, you are streaming each entry
>>> >> >> (Key/Value pair), one at a time, through your client code.
They are
>>> >> >> only held in memory if you do that yourself in your client
code. A
>>> row
>>> >> >> in Accumulo is the set of entries that share a particular value
of
>>> the
>>> >> >> Row portion of the Key. They are logically grouped, but are
not
>>> >> >> grouped in memory unless you do that.
>>> >> >>
>>> >> >> One additional note is regarding your index schema of a row
being a
>>> >> >> search term and columns being documents. You will likely have
>>> issues
>>> >> >> with this strategy, as the number of documents for high frequency
>>> >> >> terms grows, because tablets do not split in the middle of
a row.
>>> With
>>> >> >> your schema, a row could get too large to manage on a single
tablet
>>> >> >> server. A slight variation, like concatenating the search term
>>> with a
>>> >> >> document identifier in the row (term=doc1, term=doc2, ....)
would
>>> >> >> allow the high frequency terms to split into multiple tablets
if
>>> they
>>> >> >> get too large. There are better strategies, but that's just
one
>>> simple
>>> >> >> option.
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Christopher L Tubbs II
>>> >> >> http://gravatar.com/ctubbsii
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <jej2003@gmail.com
>>> >
>>> >> >> wrote:
>>> >> >> > If I have a row that as the key is a particular term and
a set of
>>> >> >> > columns
>>> >> >> > that stores the documents that the term appears in if
I load the
>>> row
>>> >> >> > is
>>> >> >> > the
>>> >> >> > contents of all of the columns also loaded?  Is there
a way to
>>> page
>>> >> >> > over
>>> >> >> > the
>>> >> >> > columns such that only N columns are in memory at any
point?  In
>>> this
>>> >> >> > particular case the documents are all in a particular
column
>>> family
>>> >> >> > (say
>>> >> >> > docs) and the column qualifier is created dynamically,
for
>>> arguments
>>> >> >> > sake we
>>> >> >> > can say they are UUIDs.
>>> >> >
>>> >> >
>>>
>>

Mime
View raw message