accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Abnormal behaviour of custom iterator in getting entries
Date Fri, 19 Jun 2015 18:26:09 GMT
Also, apparently I wrote something similar to your problem a long time ago:

https://github.com/joshelser/accumulo-column-summing

The above implementation does assume large contiguous ranges. Thought it 
might be helpful anyways.

Josh Elser wrote:
> Good, I'm glad you found it useful.
>
> The important thing to always remember is that your data is split across
> many tablet servers and that Iterators run local to each tablet server.
> As such, you cannot compute a single sum via an iterator, you can, at
> best, compute N intermediate sums -- one of each tabletserver the
> batchscanner had to talk to.
>
> Also ignore my previous comment about a second iterator. I had assumed
> you were doing something fancier than selecting a single column
> qualifier from a row.
>
> Since you're passing in what are likely multiple, disjoint ranges, I'm
> not sure you're going to get much of a performance optimization out of a
> custom iterator in this case. After each seek, your iterator would need
> to return the entries that it summed in the provided Range (the Iterator
> framework isn't designed to know the overall state of the scan -- you
> might have more data to read or you might be done. You must return the
> data when the data you're reading moves outside of the current range).
>
> The way that you'd see the real optimization an Iterator provides is if
> you are scanning over a large, contiguous set of rows specified by a
> single Range (you can get the reduction of reading many key/values into
> a single pair returned).
>
> If I mis-stated your situation, please do let me know.
>
> madhvi wrote:
>> Hi,
>>
>> Thanks for the blog you shared.I found it quite useful for my
>> requirement.
>> "How are you passing these IDs to the batch scanner?"
>> I am passing row ids received as a previous query result from another
>> table as 'new Range(entry.getKey().getRow())' in a Range type list and
>> passing that list to batch Scanner.
>>
>> "Are you trying to sum across all rows that you queried? "
>> Yes we need to sum a particular column qualifier across the rows ids
>> passed to batch scanner.How the summation can be done across the rows as
>> you said "you can put a second iterator "above" the first"?
>>
>> Thanks
>> Madhvi
>> On Wednesday 17 June 2015 08:43 PM, Josh Elser wrote:
>>> Madhvi,
>>>
>>> Understood. A few more questions..
>>>
>>> How are you passing these IDs to the batch scanner? Are you providing
>>> individual Ranges for each ID (e.g. `new Range(new Key("row1", "",
>>> "id1"), true, new Key("row1", "", "id1\x00"), false))`)? Or are you
>>> providing an entire row (or set of rows) and using the
>>> fetchColumns(Text,Text) method (or similar) on the BatchScanner?
>>>
>>> Are you trying to sum across all rows that you queried? Or is your sum
>>> per-row? If the former, that is going to cause you problems. The quick
>>> explanation is that you can't reliably know the tablet boundaries so
>>> you should try to perform an initial sum, per row. If you want, you
>>> can put a second iterator "above" the first and do a summation across
>>> all rows to reduce the amount of data sent to a client. However, if
>>> you use a BatchScanner, you will still have to perform a final
>>> summation at the client.
>>>
>>> Check out
>>> https://blogs.apache.org/accumulo/entry/thinking_about_reads_over_accumulo
>>>
>>> for more details on that..
>>>
>>> madhvi wrote:
>>>> Hi Josh,
>>>>
>>>> Sorry, my company policy doesn't allow me to share full source.What we
>>>> are tryng to do is summing over a unique field stored in column
>>>> qualifier for IDs passed to batch scanner.Can u suggest how it can be
>>>> done in accumulo.
>>>>
>>>> Thanks
>>>> Madhvi
>>>> On Wednesday 17 June 2015 10:32 AM, Josh Elser wrote:
>>>>> You put random values in the family and qualifier? Do I misunderstand
>>>>> you?
>>>>>
>>>>> Also, if you can put up the full source for the iterator, that will be
>>>>> much easier if you need help debugging it. It's hard for us to guess
>>>>> at why your code might not be working as you expect.
>>>>>
>>>>> madhvi wrote:
>>>>>> Hi Josh,
>>>>>>
>>>>>> I have changed HashMap to TreeMap which sorts lexicographically and
I
>>>>>> have inserted random values in column family and qualifier.Value
of
>>>>>> TreeMap in value.
>>>>>> Used scanner and batch scanner but getting results only with scanner.
>>>>>>
>>>>>> Thanks
>>>>>> Madhvi
>>>>>>
>>>>>> On Tuesday 16 June 2015 08:42 PM, Josh Elser wrote:
>>>>>>> Additionally, you're placing the Value into the ColumnQualifier
and
>>>>>>> dropping the ColumnFamily completely. Granted, that may not be
a
>>>>>>> problem for the specific data in your table, but it's not going
to
>>>>>>> work for any data.
>>>>>>>
>>>>>>> Christopher wrote:
>>>>>>>> You're iterating over a HashMap. That's not sorted.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Christopher L Tubbs II
>>>>>>>> http://gravatar.com/ctubbsii
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jun 16, 2015 at 1:58 AM, madhvi<madhvi.gupta@orkash.com>
>>>>>>>> wrote:
>>>>>>>>> Hi Josh,
>>>>>>>>> Thanks for replying. I will enable remote debugger on
my Accumulo
>>>>>>>>> server.
>>>>>>>>>
>>>>>>>>> However I am slightly confused with your statement "you
are not
>>>>>>>>> returning
>>>>>>>>> your data in sorted order". Can you point the part in
my iterator
>>>>>>>>> code which
>>>>>>>>> seems innapropriate and any possible solution for that?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Madhvi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tuesday 16 June 2015 11:07 AM, Josh Elser wrote:
>>>>>>>>>> //matched the condition and put values to holder
map.
>>>>>>>>>
>>>>>>
>>>>
>>

Mime
View raw message