accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ivan bella <i...@ivan.bella.name>
Subject Re: Net ColumnFamily Count
Date Thu, 20 Oct 2016 19:07:34 GMT
<!DOCTYPE html>
<html><head>
    <meta charset="UTF-8">
</head><body><p>You could cache results in an internal map.&#160; Once
the number of entries in your map gets to a certain point, you could dump them to a separate
file in hdfs and then start building a new map.&#160; Once you have completed the underlying
scan, do a merge sort and aggregation of the written files to start returning the keys.&#160;
I did something similar to this and it seems to work well.&#160; You might want to use
RFiles as the underlying format which would enable reuse of some accumulo code when doing
the merge sort.&#160; Also it would allow more efficient reseeking into the rfiles if
your iterator gets torn down and reconstructed provided you detect this and at least avoid
redoing the entire scan.<br></p><blockquote type="cite">On October 20, 2016
at 1:22 PM Yamini Joshi &#60;yamini.1691@gmail.com&#62; wrote:<br><br><div
dir="ltr"><div>Hello all<br><br></div>I am trying to find the number
of times a set of column families appear in a set of records (irrespective of the rowIds).
Is it possible to do this on the server side? My concern is that if the set of column families
is huge, it might face memory constraints on the server side. Also, we might need to generate
new keys with columnfamily name as the key and count as the value.<br><div><br
clear="all"><div><div><div class="ox-8fc481c671-gmail_signature"><div
dir="ltr"><div>Best regards,<br>Yamini Joshi</div></div></div></div></div></div></div></blockquote></body></html>
 

Mime
View raw message