mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paritosh Ranjan <pran...@xebia.com>
Subject Re: Updating a classifier model on the fly
Date Wed, 07 Mar 2012 08:39:33 GMT
You can look into ClusterIterator. It requires prior information but is 
able to train on the fly.

On 06-03-2012 22:14, Temese Szalai wrote:
> One other thing to consider about (and I don't know if Mahout supports this
> because I am very new to Mahout although very experienced with text
> classification specifically)
> is that I have seen unsupervised learning or semi-supervised learning
> approaches work for an "on the fly" re-computation of a model. This can be
> particularly helpful for
> data bootstrapping, i.e., cases where you have a small initial set of data
> and want to put some kind of filter and feedback loop in place to build a
> curated data set.
>
> This is different than classification though where you have a labeled data
> set and train the classifier to identify things that look like that data
> set.
>
> In the once or twice I've seen unsupervised or semi-supervised learning
> applied to create a "model", I've seen it work ok when there is only one
> category. So, if you are building a classic binary classifier
> and only care about one category and your system will work just fine with
> that (i.e., "is this spam? y/n?"), this might be worth looking into if your
> use cases and business needs
> really demand something on the fly and possibly can handle lower precision
> and recall while the system learns.
>
> I don't know if this is useful to you at all.
>
> Temese
>
> On Tue, Mar 6, 2012 at 8:32 AM, Boris Fersing<boris@fersing.eu>  wrote:
>
>> Thanks Charles, I'll have a look at it.
>>
>> cheers,
>> Boris
>>
>> On Tue, Mar 6, 2012 at 11:25, Charles Earl<charlescearl@me.com>  wrote:
>>> Boris,
>>> Have you looked at online decision trees and the ilke
>>> http://www.cs.washington.edu/homes/pedrod/papers/kdd01b.pdf
>>> I think ultimately the concept boils down to Temese's observation of
>> their being some measure (in the paper's case, concept drift)
>>> that triggers re-training of the entire set.
>>> C
>>> On Mar 6, 2012, at 11:17 AM, Boris Fersing wrote:
>>>
>>>> Hi Temese,
>>>>
>>>> thank you very much for this information.
>>>>
>>>> Boris
>>>>
>>>> On Tue, Mar 6, 2012 at 11:14, Temese Szalai<temeseszalai@gmail.com>
>> wrote:
>>>>> Hi Boris -
>>>>>
>>>>> Unless Mahout has super-powers that I am not aware of, years of
>> experience
>>>>> in text classification tell me that - yes, you will have to rebuild the
>>>>> classifier model regularly as new labeled data becomes available.
>>>>>
>>>>> If you are building a system that incorporates a user feedback loop as
>> it
>>>>> sounds like you are (i.e., "yes, this message is spam"), one thing that
>>>>> might reduce the amount of classifier re-training would be to verify
>> that
>>>>> the
>>>>> new incoming labeled document is not already in your data set, i.e.,
>> not a
>>>>> dupe. Additionally, you probably want to wait to retrain until you have
>>>>> some critical mass of newly labeled documents or else you have a
>> critical
>>>>> data point to include.
>>>>>
>>>>> If someone has the ability to say "no this is not spam", keeping that
>> data
>>>>> as labeled data to add to your anti-content/negative content set would
>> be
>>>>> valuable.
>>>>> Best,
>>>>> Temese
>>>>>
>>>>> On Tue, Mar 6, 2012 at 7:48 AM, Boris Fersing<boris@fersing.eu>
>> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> is there a way to update a classifier model on the fly? Or do I need
>>>>>> to recompute everything each time I add a document to a category
in
>>>>>> the training set?
>>>>>>
>>>>>> I would like to build something similar to some spam filters, where
>>>>>> you can confirm that a message is a spam or not, and thus, train
the
>>>>>> classifier.
>>>>>>
>>>>>> regards,
>>>>>> Boris
>>>>>> --
>>>>>> 42
>>>>>>
>>>>
>>>>
>>>> --
>>>> 42
>>
>>
>> --
>> 42
>>


Mime
View raw message