accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Billie J Rinaldi <billie.j.rina...@ugov.gov>
Subject Re: Writing an iterator that calculates on compaction
Date Sat, 03 Mar 2012 17:20:07 GMT
On Friday, March 2, 2012 8:57:43 PM, "Benson Margulies" <bimargulies@gmail.com> wrote:
> On Fri, Mar 2, 2012 at 8:16 PM, Billie J Rinaldi
> <billie.j.rinaldi@ugov.gov> wrote:
> > Benson,
> >
> > To calculate the centroid using an iterator, you would need to store
> > the data for each item in the cluster table. Iterators only read
> > local
> > data. For example,
> >
> >> clusterid 'items' itemid1 dimension value
> >
> > where you pack 'items' and itemid1 into the CF or itemid1 and
> > dimension into the CQ.
> >
> > This could allow an iterator to calculate a centroid. If the
> > calculation is expensive (the clusters are very large or have
> > extremely high dimensionality), you may just want to run a task
> > that reads the data and writes out the centroid instead of using
> > an iterator for this. It might be most useful to have an iterator
> > that adds new data into an existing centroid. Then you would
> > need a way of determining what is new.
> 
> I see, I'm prematurely optimizing here. That's a side-effect of trying
> to simultaneously think about the actual clustering problem and using
> accumulo at all. The number of dimensions per item isn't gigantic;
> it's a feature vector representing the context around a name mention
> in text. The number of clusters will eventually be large, that's the
> reason to think big in the first place. The current implementation
> just stuffs all data into a Lucene index. The clustering uses a
> combination of vector-space distance on these vectors and also a
> separate metric of textual name similarity; that later starts with a
> lucene query to find candidates and then scores them.
> 
> You have indeed identified my eventual reason to think about for an
> iterator-based solution, which is to be able to ingest new items
> incrementally, updating centroids etc. However, I had better
> demonstrate that I can port the whole business into accumulo at all
> first before I try to make it work incrementally and optimize that.
> 
> It occurs to me that another facet of my ignorance is on display here.
> When you wrote 'task' above, were you thinking MR, or a conventional
> program talking to the accumulo API, or something else?

Good catch -- I deliberately wrote 'task' instead of MR because in some cases it will be more
efficient to use a single client scanning over a table, and in other cases it will be more
efficient to use MR.  I would recommend trying both. 

> Anyway, thanks for continuing to help me think my way around this
> stuff.
> 
> Meanwhile, just for long-term code reading, is one of the iterators in
> the 1.4 tree a good model for compaction? I also still have this idea
> to experiment with making a Petrovic LSH that way that I might
> possibly try out at some point.

For the general idea of combining, there is org.apache.accumulo.core.iterators.Combiner in
accumulo-core.  A more unusual example is org.apache.accumulo.examples.simple.filedata.ChunkCombiner
in the simple examples.  While its main function is on-the-fly adjustment of visibility for
file chunks stored in Accumulo, it uses techniques that you will need when you write your
own iterator, such as employing two readers of the underlying data.

Billie


> >
> > Once the centroid and data are written in the same table, another
> > iterator could calculate the distance to centroid for each itemid.
> > The
> > data will have to be resorted to get it arranged by distance, so an
> > efficient way to do that might be to run a scan-time iterator and
> > write the distance data back to an Accumulo table (possibly the same
> > one) using a batch writer or MR output format.
> >
> > Billie
> >
> >
> > ----- Original Message -----
> >> From: "Benson Margulies" <bimargulies@gmail.com>
> >> To: accumulo-user@incubator.apache.org
> >> Sent: Friday, March 2, 2012 3:59:34 PM
> >> Subject: Writing an iterator that calculates on compaction
> >> Folks,
> >>
> >> I am trying to get organized to get my feet wet in using the
> >> ability
> >> of accumulo to compute near the data. I beg your pardon in advance
> >> for
> >> the following exercise in laying out what I have in mind and asking
> >> for some pointers -- particularly to examples on the 1.4 branch of
> >> code that I could warp to achieve my nefarious purposes.
> >>
> >> So, start with this data model:
> >>
> >>
> >> ROWID CF CQ V
> >> itemid 'context' dimension value
> >> itemid something else entirely...
> >>
> >> In short, for an 'item', there's a sparse feature vector associated
> >> with it (identified by cf='context'), and some other things.
> >>
> >> Meanwhile, in another table we have:
> >>
> >> clusterid 'items' itemid1 -blank-
> >> clusterid 'items' itemid2 -blank-
> >>
> >>
> >> In other words, a cluster is a grouping of the items from the first
> >> group, identified by their rowids.
> >>
> >> My initial test of my ability to find my way around a brightly lit
> >> room with a flashlight is to calculate the centrolds of these
> >> clusters, and store them as an additional CF:
> >>
> >> CF='centroid' CQ=dimension V=value
> >>
> >> And the my second test is to calculate the distance from each item
> >> to
> >> the centroid of it's cluster, and store that. Finally, I want to
> >> peruse items in descending order of their distance-from-centroid
> >> values.
> >>
> >> TIA

Mime
View raw message