accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Re: Making a RowCounterIterator
Date Fri, 15 Jul 2016 23:04:25 GMT
Ah, I thought you were doing WholeRowIterator -> RowCounterIterator
I now understand you're doing WholeRowIterator -> SomeCustomFilter (column
predicate) -> RowCounterIterator

That's okay to do, but it may be better to have an iterator that creates a
clone of its source at the beginning of each row, advances to do the
filtering, and then informs the spawning iterator to either accept or
reject. This is, admittedly, far more complicated than WholeRowIterator,
but it can safer if you have really big rows which don't fit in memory.

To your question about WholeRowIterator, yes, it's fine. The iterator will
always see sorted data (unless it's sitting on top of another iterator
which breaks this... which is possible, but not recommended at all), even
though the client may not. And yes, rows are never split (but if the query
range doesn't include the full row, it may return early). Their usage is
orthogonal, and can be used together or not.

On Fri, Jul 15, 2016 at 6:35 PM Mario Pastorelli <
mario.pastorelli@teralytics.ch> wrote:

> The WholeRowIterator is for filtering: I need all the columns that the
> filter requires so that the filter can see if the row matches or not the
> query. That's the only proper way I found to implement logic operators on
> predicated over columns of the same row.
>
> Actually I do have a question about WholeRowIterator, while we are talking
> about them. Do they make sense when used with a BatchScanner? My guess is
> yes because while the BatchScanner can return data non-sorted to the
> client, when it is scanning a single tablet the data is sorted. Because the
> data of the same rowId is never split (right?) then there is no problem in
> using a WholeRowIterator with a BatchScanner. Is this correct? I really
> can't find much documentation for Accumulo and the book doesn't help enough.
>
> On Sat, Jul 16, 2016 at 12:29 AM, Christopher <ctubbsii@apache.org> wrote:
>
>> It'd be more efficient to use the FirstEntryInRowIterator to just grab
>> one each, rather than the WholeRowIterator which could use up a lot of
>> memory unnecessarily.
>>
>> On Fri, Jul 15, 2016 at 6:20 PM Mario Pastorelli <
>> mario.pastorelli@teralytics.ch> wrote:
>>
>>> I'm actually using this after a wholerowiterator, which is used to
>>> filter rows with the same rowId.
>>>
>>> On Fri, Jul 15, 2016 at 10:02 PM, William Slacum <wslacum@gmail.com>
>>> wrote:
>>>
>>>> The iterator in the gist also counts cells/entries/KV pairs, not unique
>>>> rows. You'll want to have some way to skip to the next row value if you
>>>> want the count to be reflective of the number of rows being read.
>>>>
>>>> On Fri, Jul 15, 2016 at 3:34 PM, Shawn Walker <
>>>> accumulo@shawn-walker.net> wrote:
>>>>
>>>>> My read is that you're mistaking the sequence of calls Accumulo will
>>>>> be making to your iterator.  The sequence isn't quite the same as a Java
>>>>> iterator (initially positioned "before" the first element), and is more
>>>>> like a C++ iterator:
>>>>>
>>>>> 0. Accumulo calls seek(...)
>>>>> 1. Is there more data? Accumulo calls hasTop(). You return yes.
>>>>> 2. Ok, so there's data.  Accumulo calls getTopKey(), getTopValue() to
>>>>> retrieve the data. You return a key indicating 0 columns seen (since
next()
>>>>> hasn't yet been called)
>>>>> 3. First datum done, Accumulo calls next()
>>>>> ...
>>>>>
>>>>> I imagine that if you pull the second item out of your scan result,
>>>>> it'll have the number you expect.  Alternately, you might consider
>>>>> performing the count computation during an override of the seek(...)
>>>>> method, instead of in the next(...) method.
>>>>>
>>>>> --
>>>>> Shawn Walker
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jul 15, 2016 at 2:24 PM, Mario Pastorelli <
>>>>> mario.pastorelli@teralytics.ch> wrote:
>>>>>
>>>>>> I'm trying to create a RowCounterIterator that counts all the rows
>>>>>> and returns only one key-value with the counter inside. The problem
is that
>>>>>> I can't get it work. The Scala code is available in the gist
>>>>>> <https://gist.github.com/melrief/5f2ca248f1a980ddead2f2eeb19e6389>
>>>>>> together with some pseudo-code of a test. The problem is that if
I add an
>>>>>> entry to my table, this iterator will return 0 instead of 1 and apparently
>>>>>> the reason is that super.hasTop() is always false. I've tried without
the
>>>>>> iterator and the scanner returns 1 elements. Any idea of what I'm
doing
>>>>>> wrong here? Is WrappingIterator the right class to extend for this
kind of
>>>>>> behaviour?
>>>>>>
>>>>>> Thanks,
>>>>>> Mario
>>>>>>
>>>>>> --
>>>>>> Mario Pastorelli | TERALYTICS
>>>>>>
>>>>>> *software engineer*
>>>>>>
>>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>>>> phone: +41794381682
>>>>>> email: mario.pastorelli@teralytics.ch
>>>>>> www.teralytics.net
>>>>>>
>>>>>> Company registration number: CH-020.3.037.709-7 | Trade register
>>>>>> Canton Zurich
>>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>>>>> Yann de Vries
>>>>>>
>>>>>> This e-mail message contains confidential information which is for
>>>>>> the sole attention and use of the intended recipient. Please notify
us at
>>>>>> once if you think that it may not be intended for you and delete
it
>>>>>> immediately.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Mario Pastorelli | TERALYTICS
>>>
>>> *software engineer*
>>>
>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>> phone: +41794381682
>>> email: mario.pastorelli@teralytics.ch
>>> www.teralytics.net
>>>
>>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>>> Zurich
>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>> Yann de Vries
>>>
>>> This e-mail message contains confidential information which is for the
>>> sole attention and use of the intended recipient. Please notify us at once
>>> if you think that it may not be intended for you and delete it immediately.
>>>
>>
>
>
> --
> Mario Pastorelli | TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastorelli@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the
> sole attention and use of the intended recipient. Please notify us at once
> if you think that it may not be intended for you and delete it immediately.
>

Mime
View raw message