mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Classification beginner questions
Date Thu, 16 Jun 2011 00:41:41 GMT
Use a crypto-hash on the base data as the sorting key. The base data
is the value (payload). That should randomly permute things.

On Wed, Jun 15, 2011 at 2:50 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> It is already in Mahout, I think.
>
> On Tue, Jun 14, 2011 at 5:48 AM, Lance Norskog <goksron@gmail.com> wrote:
>
>> Coding a permutation like this in Map/Reduce is a good beginner exercise.
>>
>> On Sun, Jun 12, 2011 at 11:34 PM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>> > But the key is that you have to have both kinds of samples.  Moreover,
>> > for all of the stochastic gradient descent work, you need to have them
>> > in a random-ish order.  You can't show all of one category and then
>> > all of another.  It is even worse if you sort your data.
>> >
>> > On Mon, Jun 13, 2011 at 5:35 AM, Hector Yee <hector.yee@gmail.com>
>> wrote:
>> >> If you have a much larger background set you can try online passive
>> >> aggressive in mahout 0.6 as it uses hinge loss and does not update the
>> model
>> >> of it gets things correct.  Log loss will always have a gradient in
>> >> contrast.
>> >> On Jun 12, 2011 7:54 AM, "Joscha Feth" <joscha@feth.com> wrote:
>> >>> Hi Ted,
>> >>>
>> >>> I see. Only for the OLR or also for any other algorithm? What if my
>> >>> other category theoretically contains an infinite number of samples?
>> >>>
>> >>> Cheers,
>> >>> Joscha
>> >>>
>> >>> Am 12.06.2011 um 15:08 schrieb Ted Dunning <ted.dunning@gmail.com>:
>> >>>
>> >>>> Joscha,
>> >>>>
>> >>>> There is no implicit training. you need to give negative examples
as
>> >>>> well as positive.
>> >>>>
>> >>>>
>> >>>> On Sat, Jun 11, 2011 at 9:08 AM, Joscha Feth <joscha@feth.com>
wrote:
>> >>>>> Hello Ted,
>> >>>>>
>> >>>>> thanks for your response!
>> >>>>> What I wanted to accomplish is actually quite simple in theory:
I
>> have
>> >> some
>> >>>>> sentences which have things in common (like some similar words
for
>> >> example).
>> >>>>> I want to train my model with these example sentences I have.
Once it
>> is
>> >>>>> trained I want to give an unknown sentence to my classifier
and would
>> >> like
>> >>>>> to get back a percentage to which the unknown sentence is similar
to
>> the
>> >>>>> sentences I trained my model with. So basically I have two categories
>> >>>>> (sentence is similar and sentence is not similar). To my
>> understanding
>> >> it
>> >>>>> does only make sense to train my model with the positives (e.g.
the
>> >> sample
>> >>>>> sentences) and put them all into the same category (I chose
category
>> 0,
>> >>>>> because the .classifyScalar() method seems to return the probability
>> for
>> >> the
>> >>>>> first category, e.g. category 0). All other sentences are implicitly
>> >> (but
>> >>>>> not trained) in the second category (category 1).
>> >>>>>
>> >>>>> Does that make sense or am I completely off here?
>> >>>>>
>> >>>>> Kind regards,
>> >>>>> Joscha Feth
>> >>>>>
>> >>>>> On Sat, Jun 11, 2011 at 03:46, Ted Dunning <ted.dunning@gmail.com>
>> >> wrote:
>> >>>>>>
>> >>>>>> The target variable here is always zero.
>> >>>>>>
>> >>>>>> Shouldn't it vary?
>> >>>>>>
>> >>>>>> On Fri, Jun 10, 2011 at 9:54 AM, Joscha Feth <joscha@feth.com>
>> wrote:
>> >>>>>>> algorithm.train(0, generateVector(animal));
>> >>>>>>>
>> >>>>>
>> >>>>>
>> >>
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message