mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott C. Cote" <>
Subject Re: streaming kmeans vs incremental canopy/solr/kmeans
Date Fri, 14 Feb 2014 18:50:53 GMT
Right now - I'm dealing with only 40,000 documents, but we will eventually
grow more than 10x (put on the manager hat and say 1 mil docs) where a doc
is usually no longer than 20 or 30 words.


On 2/14/14 12:46 PM, "Ted Dunning" <> wrote:

>How much data do you have?
>How much do you plan to have?
>On Fri, Feb 14, 2014 at 8:04 AM, Scott C. Cote <>
>> Hello All,
>> I have two questions (Q1, Q2).
>> Q1: Am digging in to Text Analysis and am wrestling with competing
>> data maintenance strategies.
>> NOTE: my text comes from a very narrowly focused source.
>> - Am currently crunching the data (batch) using the following scheme:
>> 1. Load source text as rows in a mysql database.
>> 2. Create named TFIDF vectors using a custom analyzer from source text
>> (-stopwords, lowercase, std filter, Š.)
>> 3. Perform Canopy Cluster and then Kmeans Cluster using an enhanced
>> metric (derived from a custom metric found in MiA)
>> 4. Load references of Clusters into SOLR (core1) ­ cluster id, top terms
>> along with full cluster data into Mongo (a cluster is a doc)
>> 5. Then load source text into SOLR(core2) using same custom analyzer
>> appropriate boost along with the reference cluster id
>> NOTE: in all cases, the id of the source text is preserved throughout
>> flow in the vector naming process, etc.
>> So now I have a mysql table,  two SOLR cores, and a Mongo Document
>> Collection (all tied together with text id as the common name)
>> - Now when  a new document enters the system after "batch" has been
>> performed, I use core2 to test the top  SOLR matches (custom analyzer
>> normalizes the new doc) to find best cluster within a tolerance.  If a
>> cluster is found, then I place the text in that cluster ­ if not, then I
>> start a new group (my word for a cluster not generated via kmeans).
>> way, the doc makes its way into both (core1 and core2). I keep track of
>> number of group creations/document placements so that if a threshold is
>> crossed, then I can re-batch the data.
>> In MiA, (I think ch 11), suggests that a user could run the canopy
>> routine to assign new entries to the clusters (instead of what I am
>> Does he mean to regenerate a new dictionary, frequencies, etc for the
>> corpus
>> for every inbound document?  My observations have been that this has
>>been a
>> very speedy process, but I'm hoping that I'm just too much of a novice
>> haven't thought of a way to simply update the dictionary/frequencies.
>>  (this
>> process also calls for the eventual rebatching of the clusters).
>> While I was very early in my "implement what I have read" process,
>> and Ted recommended that I examine the Streaming Kmeans process.  Would
>> that
>> process sidestep much of what I'm doing?
>> Q2: I need to really understand the lexicon of my corpus.  How do I see
>> list of terms that have been omitted due either to being in too many
>> documents or are not in enough documents for consideration?
>> Please know that I know that I can look at the dictionary to see what
>> are covered.  And since my custom analyzer is using the
>> StandardAnalyzer.stop words, those are obvious also.  If there isn't an
>> option to emit the  omitted words, where would be the natural place to
>> capture that data and save it into yet another data store (Sequence
>> file,etc)?
>> Thanks in Advance for the Guidance,
>> SCott

View raw message