accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dylan Hutchison <dhutc...@cs.washington.edu>
Subject Re: Making a RowCounterIterator
Date Sat, 16 Jul 2016 00:17:36 GMT
Hi Mario,

As you gain more experience with Accumulo, feel free to write or modify
Accumulo's documentation in the places you find it lacking and send a PR.
If you find a topic confusing, probably many others do too.

Cheers, Dylan

On Fri, Jul 15, 2016 at 4:04 PM, Christopher <ctubbsii@apache.org> wrote:

> Ah, I thought you were doing WholeRowIterator -> RowCounterIterator
> I now understand you're doing WholeRowIterator -> SomeCustomFilter (column
> predicate) -> RowCounterIterator
>
> That's okay to do, but it may be better to have an iterator that creates a
> clone of its source at the beginning of each row, advances to do the
> filtering, and then informs the spawning iterator to either accept or
> reject. This is, admittedly, far more complicated than WholeRowIterator,
> but it can safer if you have really big rows which don't fit in memory.
>
> To your question about WholeRowIterator, yes, it's fine. The iterator will
> always see sorted data (unless it's sitting on top of another iterator
> which breaks this... which is possible, but not recommended at all), even
> though the client may not. And yes, rows are never split (but if the query
> range doesn't include the full row, it may return early). Their usage is
> orthogonal, and can be used together or not.
>
> On Fri, Jul 15, 2016 at 6:35 PM Mario Pastorelli <
> mario.pastorelli@teralytics.ch> wrote:
>
>> The WholeRowIterator is for filtering: I need all the columns that the
>> filter requires so that the filter can see if the row matches or not the
>> query. That's the only proper way I found to implement logic operators on
>> predicated over columns of the same row.
>>
>> Actually I do have a question about WholeRowIterator, while we are
>> talking about them. Do they make sense when used with a BatchScanner? My
>> guess is yes because while the BatchScanner can return data non-sorted to
>> the client, when it is scanning a single tablet the data is sorted. Because
>> the data of the same rowId is never split (right?) then there is no problem
>> in using a WholeRowIterator with a BatchScanner. Is this correct? I really
>> can't find much documentation for Accumulo and the book doesn't help enough.
>>
>> On Sat, Jul 16, 2016 at 12:29 AM, Christopher <ctubbsii@apache.org>
>> wrote:
>>
>>> It'd be more efficient to use the FirstEntryInRowIterator to just grab
>>> one each, rather than the WholeRowIterator which could use up a lot of
>>> memory unnecessarily.
>>>
>>> On Fri, Jul 15, 2016 at 6:20 PM Mario Pastorelli <
>>> mario.pastorelli@teralytics.ch> wrote:
>>>
>>>> I'm actually using this after a wholerowiterator, which is used to
>>>> filter rows with the same rowId.
>>>>
>>>> On Fri, Jul 15, 2016 at 10:02 PM, William Slacum <wslacum@gmail.com>
>>>> wrote:
>>>>
>>>>> The iterator in the gist also counts cells/entries/KV pairs, not
>>>>> unique rows. You'll want to have some way to skip to the next row value
if
>>>>> you want the count to be reflective of the number of rows being read.
>>>>>
>>>>> On Fri, Jul 15, 2016 at 3:34 PM, Shawn Walker <
>>>>> accumulo@shawn-walker.net> wrote:
>>>>>
>>>>>> My read is that you're mistaking the sequence of calls Accumulo will
>>>>>> be making to your iterator.  The sequence isn't quite the same as
a Java
>>>>>> iterator (initially positioned "before" the first element), and is
more
>>>>>> like a C++ iterator:
>>>>>>
>>>>>> 0. Accumulo calls seek(...)
>>>>>> 1. Is there more data? Accumulo calls hasTop(). You return yes.
>>>>>> 2. Ok, so there's data.  Accumulo calls getTopKey(), getTopValue()
to
>>>>>> retrieve the data. You return a key indicating 0 columns seen (since
next()
>>>>>> hasn't yet been called)
>>>>>> 3. First datum done, Accumulo calls next()
>>>>>> ...
>>>>>>
>>>>>> I imagine that if you pull the second item out of your scan result,
>>>>>> it'll have the number you expect.  Alternately, you might consider
>>>>>> performing the count computation during an override of the seek(...)
>>>>>> method, instead of in the next(...) method.
>>>>>>
>>>>>> --
>>>>>> Shawn Walker
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 15, 2016 at 2:24 PM, Mario Pastorelli <
>>>>>> mario.pastorelli@teralytics.ch> wrote:
>>>>>>
>>>>>>> I'm trying to create a RowCounterIterator that counts all the
rows
>>>>>>> and returns only one key-value with the counter inside. The problem
is that
>>>>>>> I can't get it work. The Scala code is available in the gist
>>>>>>> <https://gist.github.com/melrief/5f2ca248f1a980ddead2f2eeb19e6389>
>>>>>>> together with some pseudo-code of a test. The problem is that
if I add an
>>>>>>> entry to my table, this iterator will return 0 instead of 1 and
apparently
>>>>>>> the reason is that super.hasTop() is always false. I've tried
without the
>>>>>>> iterator and the scanner returns 1 elements. Any idea of what
I'm doing
>>>>>>> wrong here? Is WrappingIterator the right class to extend for
this kind of
>>>>>>> behaviour?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Mario
>>>>>>>
>>>>>>> --
>>>>>>> Mario Pastorelli | TERALYTICS
>>>>>>>
>>>>>>> *software engineer*
>>>>>>>
>>>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>>>>> phone: +41794381682
>>>>>>> email: mario.pastorelli@teralytics.ch
>>>>>>> www.teralytics.net
>>>>>>>
>>>>>>> Company registration number: CH-020.3.037.709-7 | Trade register
>>>>>>> Canton Zurich
>>>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark
>>>>>>> Schmitz, Yann de Vries
>>>>>>>
>>>>>>> This e-mail message contains confidential information which is
for
>>>>>>> the sole attention and use of the intended recipient. Please
notify us at
>>>>>>> once if you think that it may not be intended for you and delete
it
>>>>>>> immediately.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Mario Pastorelli | TERALYTICS
>>>>
>>>> *software engineer*
>>>>
>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>> phone: +41794381682
>>>> email: mario.pastorelli@teralytics.ch
>>>> www.teralytics.net
>>>>
>>>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>>>> Zurich
>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>>> Yann de Vries
>>>>
>>>> This e-mail message contains confidential information which is for the
>>>> sole attention and use of the intended recipient. Please notify us at once
>>>> if you think that it may not be intended for you and delete it immediately.
>>>>
>>>
>>
>>
>> --
>> Mario Pastorelli | TERALYTICS
>>
>> *software engineer*
>>
>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>> phone: +41794381682
>> email: mario.pastorelli@teralytics.ch
>> www.teralytics.net
>>
>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>> Zurich
>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>> Yann de Vries
>>
>> This e-mail message contains confidential information which is for the
>> sole attention and use of the intended recipient. Please notify us at once
>> if you think that it may not be intended for you and delete it immediately.
>>
>

Mime
View raw message