accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Re: Making a RowCounterIterator
Date Sat, 16 Jul 2016 01:07:32 GMT
+1 and we'll add you to the contributors list for doing so, if you want and
aren't already on it.

On Fri, Jul 15, 2016, 20:18 Dylan Hutchison <dhutchis@cs.washington.edu>
wrote:

> Hi Mario,
>
> As you gain more experience with Accumulo, feel free to write or modify
> Accumulo's documentation in the places you find it lacking and send a PR.
> If you find a topic confusing, probably many others do too.
>
> Cheers, Dylan
>
> On Fri, Jul 15, 2016 at 4:04 PM, Christopher <ctubbsii@apache.org> wrote:
>
>> Ah, I thought you were doing WholeRowIterator -> RowCounterIterator
>> I now understand you're doing WholeRowIterator -> SomeCustomFilter
>> (column predicate) -> RowCounterIterator
>>
>> That's okay to do, but it may be better to have an iterator that creates
>> a clone of its source at the beginning of each row, advances to do the
>> filtering, and then informs the spawning iterator to either accept or
>> reject. This is, admittedly, far more complicated than WholeRowIterator,
>> but it can safer if you have really big rows which don't fit in memory.
>>
>> To your question about WholeRowIterator, yes, it's fine. The iterator
>> will always see sorted data (unless it's sitting on top of another iterator
>> which breaks this... which is possible, but not recommended at all), even
>> though the client may not. And yes, rows are never split (but if the query
>> range doesn't include the full row, it may return early). Their usage is
>> orthogonal, and can be used together or not.
>>
>> On Fri, Jul 15, 2016 at 6:35 PM Mario Pastorelli <
>> mario.pastorelli@teralytics.ch> wrote:
>>
>>> The WholeRowIterator is for filtering: I need all the columns that the
>>> filter requires so that the filter can see if the row matches or not the
>>> query. That's the only proper way I found to implement logic operators on
>>> predicated over columns of the same row.
>>>
>>> Actually I do have a question about WholeRowIterator, while we are
>>> talking about them. Do they make sense when used with a BatchScanner? My
>>> guess is yes because while the BatchScanner can return data non-sorted to
>>> the client, when it is scanning a single tablet the data is sorted. Because
>>> the data of the same rowId is never split (right?) then there is no problem
>>> in using a WholeRowIterator with a BatchScanner. Is this correct? I really
>>> can't find much documentation for Accumulo and the book doesn't help enough.
>>>
>>> On Sat, Jul 16, 2016 at 12:29 AM, Christopher <ctubbsii@apache.org>
>>> wrote:
>>>
>>>> It'd be more efficient to use the FirstEntryInRowIterator to just grab
>>>> one each, rather than the WholeRowIterator which could use up a lot of
>>>> memory unnecessarily.
>>>>
>>>> On Fri, Jul 15, 2016 at 6:20 PM Mario Pastorelli <
>>>> mario.pastorelli@teralytics.ch> wrote:
>>>>
>>>>> I'm actually using this after a wholerowiterator, which is used to
>>>>> filter rows with the same rowId.
>>>>>
>>>>> On Fri, Jul 15, 2016 at 10:02 PM, William Slacum <wslacum@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> The iterator in the gist also counts cells/entries/KV pairs, not
>>>>>> unique rows. You'll want to have some way to skip to the next row
value if
>>>>>> you want the count to be reflective of the number of rows being read.
>>>>>>
>>>>>> On Fri, Jul 15, 2016 at 3:34 PM, Shawn Walker <
>>>>>> accumulo@shawn-walker.net> wrote:
>>>>>>
>>>>>>> My read is that you're mistaking the sequence of calls Accumulo
will
>>>>>>> be making to your iterator.  The sequence isn't quite the same
as a Java
>>>>>>> iterator (initially positioned "before" the first element), and
is more
>>>>>>> like a C++ iterator:
>>>>>>>
>>>>>>> 0. Accumulo calls seek(...)
>>>>>>> 1. Is there more data? Accumulo calls hasTop(). You return yes.
>>>>>>> 2. Ok, so there's data.  Accumulo calls getTopKey(), getTopValue()
>>>>>>> to retrieve the data. You return a key indicating 0 columns seen
(since
>>>>>>> next() hasn't yet been called)
>>>>>>> 3. First datum done, Accumulo calls next()
>>>>>>> ...
>>>>>>>
>>>>>>> I imagine that if you pull the second item out of your scan result,
>>>>>>> it'll have the number you expect.  Alternately, you might consider
>>>>>>> performing the count computation during an override of the seek(...)
>>>>>>> method, instead of in the next(...) method.
>>>>>>>
>>>>>>> --
>>>>>>> Shawn Walker
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 15, 2016 at 2:24 PM, Mario Pastorelli <
>>>>>>> mario.pastorelli@teralytics.ch> wrote:
>>>>>>>
>>>>>>>> I'm trying to create a RowCounterIterator that counts all
the rows
>>>>>>>> and returns only one key-value with the counter inside. The
problem is that
>>>>>>>> I can't get it work. The Scala code is available in the gist
>>>>>>>> <https://gist.github.com/melrief/5f2ca248f1a980ddead2f2eeb19e6389>
>>>>>>>> together with some pseudo-code of a test. The problem is
that if I add an
>>>>>>>> entry to my table, this iterator will return 0 instead of
1 and apparently
>>>>>>>> the reason is that super.hasTop() is always false. I've tried
without the
>>>>>>>> iterator and the scanner returns 1 elements. Any idea of
what I'm doing
>>>>>>>> wrong here? Is WrappingIterator the right class to extend
for this kind of
>>>>>>>> behaviour?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Mario
>>>>>>>>
>>>>>>>> --
>>>>>>>> Mario Pastorelli | TERALYTICS
>>>>>>>>
>>>>>>>> *software engineer*
>>>>>>>>
>>>>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>>>>>> phone: +41794381682
>>>>>>>> email: mario.pastorelli@teralytics.ch
>>>>>>>> www.teralytics.net
>>>>>>>>
>>>>>>>> Company registration number: CH-020.3.037.709-7 | Trade register
>>>>>>>> Canton Zurich
>>>>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark
>>>>>>>> Schmitz, Yann de Vries
>>>>>>>>
>>>>>>>> This e-mail message contains confidential information which
is for
>>>>>>>> the sole attention and use of the intended recipient. Please
notify us at
>>>>>>>> once if you think that it may not be intended for you and
delete it
>>>>>>>> immediately.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mario Pastorelli | TERALYTICS
>>>>>
>>>>> *software engineer*
>>>>>
>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>>> phone: +41794381682
>>>>> email: mario.pastorelli@teralytics.ch
>>>>> www.teralytics.net
>>>>>
>>>>> Company registration number: CH-020.3.037.709-7 | Trade register
>>>>> Canton Zurich
>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>>>> Yann de Vries
>>>>>
>>>>> This e-mail message contains confidential information which is for the
>>>>> sole attention and use of the intended recipient. Please notify us at
once
>>>>> if you think that it may not be intended for you and delete it immediately.
>>>>>
>>>>
>>>
>>>
>>> --
>>> Mario Pastorelli | TERALYTICS
>>>
>>> *software engineer*
>>>
>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>> phone: +41794381682
>>> email: mario.pastorelli@teralytics.ch
>>> www.teralytics.net
>>>
>>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>>> Zurich
>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>> Yann de Vries
>>>
>>> This e-mail message contains confidential information which is for the
>>> sole attention and use of the intended recipient. Please notify us at once
>>> if you think that it may not be intended for you and delete it immediately.
>>>
>>
>

Mime
View raw message