accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Billie J Rinaldi <billie.j.rina...@ugov.gov>
Subject Re: Writing an iterator that calculates on compaction
Date Sat, 03 Mar 2012 01:16:51 GMT
Benson,

To calculate the centroid using an iterator, you would need to store
the data for each item in the cluster table. Iterators only read local
data. For example,

> clusterid  'items'  itemid1 dimension       value

where you pack 'items' and itemid1 into the CF or itemid1 and
dimension into  the CQ.

This could allow an iterator to calculate a centroid. If the
calculation is expensive (the clusters are very large or have
extremely high dimensionality), you may just want to run a task
that reads the data and writes out the centroid instead  of using
an iterator for this. It might be most useful to have an iterator
that adds new data into an existing centroid. Then you would
need a way of determining what is new.

Once the centroid and data are written in the same table, another
iterator could calculate the distance to centroid for each itemid. The
data will have to be resorted to get it arranged by distance, so an
efficient way to do that might be to run a scan-time iterator and
write the distance data back to an Accumulo table (possibly the same
one) using a batch writer or MR output format.

Billie


----- Original Message -----
> From: "Benson Margulies" <bimargulies@gmail.com>
> To: accumulo-user@incubator.apache.org
> Sent: Friday, March 2, 2012 3:59:34 PM
> Subject: Writing an iterator that calculates on compaction
> Folks,
> 
> I am trying to get organized to get my feet wet in using the ability
> of accumulo to compute near the data. I beg your pardon in advance for
> the following exercise in laying out what I have in mind and asking
> for some pointers -- particularly to examples on the 1.4 branch of
> code that I could warp to achieve my nefarious purposes.
> 
> So, start with this data model:
> 
> 
> ROWID CF CQ V
> itemid 'context' dimension value
> itemid something else entirely...
> 
> In short, for an 'item', there's a sparse feature vector associated
> with it (identified by cf='context'), and some other things.
> 
> Meanwhile, in another table we have:
> 
> clusterid 'items' itemid1 -blank-
> clusterid 'items' itemid2 -blank-
> 
> 
> In other words, a cluster is a grouping of the items from the first
> group, identified by their rowids.
> 
> My initial test of my ability to find my way around a brightly lit
> room with a flashlight is to calculate the centrolds of these
> clusters, and store them as an additional CF:
> 
> CF='centroid' CQ=dimension V=value
> 
> And the my second test is to calculate the distance from each item to
> the centroid of it's cluster, and store that. Finally, I want to
> peruse items in descending order of their distance-from-centroid
> values.
> 
> TIA

Mime
View raw message