accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ivan bella <>
Subject Re: Net ColumnFamily Count
Date Thu, 20 Oct 2016 21:18:21 GMT
<!DOCTYPE html>
    <meta charset="UTF-8">
</head><body><p>I do not have any reference code for you.&#160; However
basically you want to write a program that scans from one table, creates new transformed Key
which you write as Mutations to another table.&#160; The transfomed Key object&#39;s
row would be the column family of the key you pulled from the scan, and the value would be
a 1 encoded using one of the encoders in the LongCombiner class.&#160; You would create
the new table you are going to write to manually in the accumulo shell and set a SummingCombiner
on the majc, minc, and scan with the same encoder you used.&#160; Run your program, compact
the new table, and then scan it.<br></p><p><br></p><blockquote
type="cite">On October 20, 2016 at 4:07 PM Yamini Joshi &#60;;
wrote:<br><br><div dir="ltr">Alright! Do you happen to have some reference
code that I can refer to? I am a newbie and I am not sure if by caching, aggregating and merge
sort you mean to use some Accumulo wrapper or write a simple java code.<br></div><div
class="ox-3f54c3c603-gmail_extra"><br clear="all"><div><div class="ox-3f54c3c603-gmail_signature"><div
dir="ltr"><div>Best regards,<br>Yamini Joshi</div></div></div></div><br><div
class="ox-3f54c3c603-gmail_quote">On Thu, Oct 20, 2016 at 2:49 PM, ivan bella &#60;<a
href="" target="_blank"></a>&#62;
wrote:<br><blockquote><u></u><div><p>That is essentially
the same thing, but instead of doing it within an iterator, you are letting accumulo do the
work!&#160; Perfect.<br></p><div><div class="ox-3f54c3c603-h5"><blockquote
type="cite">On October 20, 2016 at 3:38 PM <a href="" target="_blank"></a>
wrote:<br><br><div>I am wondering what the complexity would be for this
and also how does it compare to creating a new table with the required revered data and calculating
the sum using an iterator.<br><br>Sent from my iPhone</div><div><br>On
Oct 20, 2016, at 2:07 PM, ivan bella &#60;<a href="" target="_blank"></a>&#62;
wrote:<br><br></div><blockquote type="cite"><div><p>You
could cache results in an internal map.&#160; Once the number of entries in your map gets
to a certain point, you could dump them to a separate file in hdfs and then start building
a new map.&#160; Once you have completed the underlying scan, do a merge sort and aggregation
of the written files to start returning the keys.&#160; I did something similar to this
and it seems to work well.&#160; You might want to use RFiles as the underlying format
which would enable reuse of some accumulo code when doing the merge sort.&#160; Also it
would allow more efficient reseeking into the rfiles if your iterator gets torn down and reconstructed
provided you detect this and at least avoid redoing the entire scan.<br></p><blockquote
type="cite">On October 20, 2016 at 1:22 PM Yamini Joshi &#60;<a href=""
target="_blank"></a>&#62; wrote:<br><br><div
dir="ltr"><div>Hello all<br><br></div>I am trying to find the number
of times a set of column families appear in a set of records (irrespective of the rowIds).
Is it possible to do this on the server side? My concern is that if the set of column families
is huge, it might face memory constraints on the server side. Also, we might need to generate
new keys with columnfamily name as the key and count as the value.<br><div><br
clear="all"><div><div><div class="ox-3f54c3c603-m_-6791074042057994589ox-c152bb93ed-ox-8fc481c671-gmail_signature"><div
dir="ltr"><div>Best regards,<br>Yamini Joshi</div></div></div></div></div></div></div></blockquote></div></blockquote></blockquote></div></div></div></blockquote></div><br></div></blockquote></body></html>

View raw message