accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yamini.1...@gmail.com
Subject Re: Net ColumnFamily Count
Date Thu, 20 Oct 2016 19:38:40 GMT
I am wondering what the complexity would be for this and also how does it compare to creating
a new table with the required revered data and calculating the sum using an iterator.

Sent from my iPhone

> On Oct 20, 2016, at 2:07 PM, ivan bella <ivan@ivan.bella.name> wrote:
> 
> You could cache results in an internal map.  Once the number of entries in your map gets
to a certain point, you could dump them to a separate file in hdfs and then start building
a new map.  Once you have completed the underlying scan, do a merge sort and aggregation of
the written files to start returning the keys.  I did something similar to this and it seems
to work well.  You might want to use RFiles as the underlying format which would enable reuse
of some accumulo code when doing the merge sort.  Also it would allow more efficient reseeking
into the rfiles if your iterator gets torn down and reconstructed provided you detect this
and at least avoid redoing the entire scan.
> 
>> On October 20, 2016 at 1:22 PM Yamini Joshi <yamini.1691@gmail.com> wrote:
>> 
>> Hello all
>> 
>> I am trying to find the number of times a set of column families appear in a set
of records (irrespective of the rowIds). Is it possible to do this on the server side? My
concern is that if the set of column families is huge, it might face memory constraints on
the server side. Also, we might need to generate new keys with columnfamily name as the key
and count as the value.
>> 
>> Best regards,
>> Yamini Joshi

Mime
View raw message