mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: How to SSVD output to generate Clusters
Date Thu, 08 Aug 2013 05:44:47 GMT
@Stuti:

Ok, so I added additional tests (there was a small bug in the test) and
actually in local tests the name keys and named vector names are both
propagating on all execution paths. So i cannot confirm problem of
no-propagation. (this is on 0.9 trunk).

I did find a problem DB Tsai mentioned to me earlier which is affecting
correctness of PCA computation if overwrite=true is given (basically, job
cleanup accidentally wipes some vector information passed from the frontend
in this case and the job does not assert its existence). Again, this is
only specific to PCA and only if -ow (cleanup of target directory) is
requested. I also extended tests and cleaned up some architectural issues.
I will commit a small fix, but i am still not sure what casues problems in
Stuti's case.

-d




On Wed, Aug 7, 2013 at 2:51 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> yes. i think it got broken after replacing side loaders
> by SequenceFileDirValueIterator stuff which accepts glob but for some
> reason direct file name does not find the file...
>
> (i would assume if glob is something like /path/file* than a direct name
> in a form /path/file should also be a valid glob but it seems not to pick
> the file).
>
>
>
> On Wed, Aug 7, 2013 at 1:09 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>wrote:
>
>> also i think I am seeing the issues DB mentioned to me yesterday at
>> sparkML meetup (or something similar).
>>
>>
>> On Wed, Aug 7, 2013 at 1:01 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>wrote:
>>
>>> Thanks, Stuti.
>>>
>>> ok, i think there's something  indeed going on with PCA stuff. it may
>>> require a patch.
>>>
>>> -d
>>>
>>>
>>> On Wed, Aug 7, 2013 at 2:12 AM, Stuti Awasthi <stutiawasthi@hcl.com>wrote:
>>>
>>>> I have not used -q option while running ssvd. Here are the commands
>>>> which I run :
>>>>
>>>> Seq2sparse :
>>>> mahout-distribution-0.7/bin/mahout seq2sparse -i /stuti/SSVD/data-seq
>>>> -o /stuti/SSVD/data-vectors -s 5 -ml 50 -nv -ng 3 -n 2 -x 70
>>>>
>>>> SSVD
>>>> hadoop jar mahout-distribution-0.7/mahout-core-0.8-job.jar
>>>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli -i
>>>> /stuti/SSVD/data-vectors/tf-vectors -o /stuti/SSVD/Output -k 90 -U true -V
>>>> true --reduceTasks 1
>>>>
>>>> Thanks
>>>> Stuti Awasthi
>>>>
>>>> -----Original Message-----
>>>> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>>>> Sent: Wednesday, August 07, 2013 2:14 PM
>>>> To: user@mahout.apache.org
>>>> Subject: RE: How to SSVD output to generate Clusters
>>>>
>>>> Thanks Stuti. Yes it looks like it is not there. Let me run a test. One
>>>> question . Did you use -q 0 or 1 or sometni g elw   se?
>>>> On Aug 7, 2013 12:18 AM, "Stuti Awasthi" <stutiawasthi@hcl.com> wrote:
>>>>
>>>> > Hey Dmitriy,
>>>> >
>>>> > Sorry for replying late.
>>>> > I have re-run the steps to generate U. Here are the output:
>>>> >
>>>> > 1. "seq2sparse" Output: Contains Named Vector
>>>> > Key: /Description_10: Value:
>>>> > /Description_10:{554:1.0,54:1.0,514:1.0,322:1.0,247:1.0,91:1.0,127:1.0
>>>> > ,456:1.0,480:1.0,713:1.0,674:1.0,117:1.0,461:1.0,595:2.0,446:1.0,296:2
>>>> > .0}
>>>> >
>>>> > 2. "ssvd" Output of U: Not Contained NamedVector
>>>> > Key: /Description_10: Value:
>>>> >
>>>> {0:0.010564205396743468,1:-0.01989403962804719,2:0.015640314765729225,3:0.04031717183780774,4:-0.03325995251075869,5:0.0294201152018514,6:
>>>> >
>>>> > 0.03834130611889856,7:-0.008686421005328312,8:-0.06164515883823538,9:0
>>>> > .03752875772953153,10:0.04739786931946798,11:-0.07912744917669134,12:0
>>>> > .020078421275704143,13:-0.0
>>>> >
>>>> > 4017504785907734,14:0.012539132502559502,15:0.0733073647645918,16:-0.0
>>>> > 2111033727307056,17:0.0799478317610193,18:-0.08481960414593219,19:-0.0
>>>> > 6987848875856222,20:0.03693
>>>> >
>>>> > 2920091059446,21:-0.06949180571421532,22:-0.03447267994522256,23:-0.07
>>>> > 104196347181493,24:0.022262180555421562,25:-0.0485632586340187,26:-0.0
>>>> > 5380823388650383,27:0.09299
>>>> >
>>>> > 533887785207,28:0.0019344239524856396,29:0.002936116541403362,30:-0.07
>>>> > 249587007236825,31:0.0016026176038041033,32:-0.0711115256224166,33:0.0
>>>> > 6603931206284432,34:0.01922
>>>> >
>>>> > 6806201249697,35:0.13972781245330326,36:0.0787696939450401,37:0.070653
>>>> > 56340476747,38:0.08437107545490818,39:0.06381670380272558,40:0.0464059
>>>> > 64753673735,41:0.0601332388
>>>> >
>>>> > 594578,42:-0.12996454299711707,43:0.10779361589915878,44:-0.0652470275
>>>> > 4474347,45:0.014785171162887613,46:-0.036630574690084586,47:-0.1506665
>>>> > 6149902793,48:0.16190482591
>>>> > 405958,49:-0.00869851116149916}
>>>> >
>>>> > As you said that propagation of keys to keys or/and names to names
>>>> > should happen . Any idea what's going wrong or if there is a mistake
>>>> from my side ?
>>>> >
>>>> > Thanks
>>>> > Stuti Awasthi
>>>> >
>>>> > -----Original Message-----
>>>> > From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
>>>> > Sent: Friday, August 02, 2013 11:39 PM
>>>> > To: user@mahout.apache.org
>>>> > Subject: Re: How to SSVD output to generate Clusters
>>>> >
>>>> > by eyeballing the code, i think i don' t see a problem. if rows of A
>>>> > are named values, then row of U (or U*Sigma or U*Sigma^1/2) would also
>>>> > retain names from values of rows of A. Output would not contain
>>>> > NamedVector values for rows that were not NamedVector values
>>>> > themselves in the input. Does seq2sparse output create NamedVectors
>>>> as values ?
>>>> >
>>>> > Note that if what you want is to have *keys* from from seq2sparse
>>>> > (such as
>>>> > 'file_1') propagated to name of named vector value in U, that much is
>>>> > not happening. The algorithm propagates keys to keys or/and names to
>>>> > names (but not any other combinations of those).
>>>> >
>>>> > -d
>>>> >
>>>> >
>>>> > On Fri, Aug 2, 2013 at 10:42 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>>>> > wrote:
>>>> >
>>>> > > it should. i worked on the issue and last time it was checked it
was
>>>> > > still working with name propagation. if not, then it is a bug
>>>> > >
>>>> > >
>>>> > > On Fri, Aug 2, 2013 at 3:33 AM, Stuti Awasthi <stutiawasthi@hcl.com
>>>> > >wrote:
>>>> > >
>>>> > >> Hey Ted,
>>>> > >>
>>>> > >> As suggested, I tried SSVD with Mahout 0.8. I think the issue
of
>>>> > >> NamedVector not propagating to the output U ,still persists.
>>>> > >> Here is what I have done :
>>>> > >>
>>>> > >> 1. Created featureVector using "seq2sparse" with -nv option.
>>>> > >> Checked the output, named vector created.
>>>> > >> 2. Provided this featureVector to "ssvd" with params " -k 100
-U
>>>> > >> true -V true". After execution, 3 output got generated namely
:
>>>> > >> sigma, U, V 3. I dumped "U" to check the output if it contained
>>>> namedVectors:
>>>> > >>
>>>> > >> mahout-distribution-0.7/bin/mahout seqdumper -i
>>>> > >> /stuti/SSVD/Output/U
>>>> > >> | more
>>>> > >> Output:
>>>> > >> Key: /File_1: Value:
>>>> > >>
>>>> {0:0.027019746696983288,1:0.006124424321845726,2:0.0334311500858222,.
>>>> > >> ....}
>>>> > >>
>>>> > >> I did not see the NamedVector getting created in the output
of
>>>> ssvd.
>>>> > >> Please point out if I have missed any step in between.
>>>> > >>
>>>> > >> As I wanted to perform the Clustering, I took the output of
"U" and
>>>> > >> generated the NamedVector with custom code. The output looks
like
>>>> this :
>>>> > >> Key: /File_1: Value:
>>>> > >> /File_1:{0:0.027019746696983288,1:0.006124424321845726,2:0.03343115
>>>> > >> 00
>>>> > >> 858222,.....}
>>>> > >>
>>>> > >> Then I fed this namedvector file to KMeans to generate 10 Clusters.
>>>> > >> In this I have used Random Centroid selection with KMeans.
>>>> > >> Finally I dumped the ClusterOutput as :
>>>> > >> <ClusterId>,<DocumentID1>,<DocumentId2>.....
>>>> > >>
>>>> > >> Please let me know if I have performed any mistake in the end
to
>>>> > >> end execution as well Im not sure Why SSVD output is not generating
>>>> > >> the named vectors as the issue id fixed..
>>>> > >>
>>>> > >> Please suggest
>>>> > >>
>>>> > >> Regards
>>>> > >> Stuti Awasthi
>>>> > >>
>>>> > >>
>>>> > >>
>>>> > >>
>>>> > >>
>>>> > >> -----Original Message-----
>>>> > >> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>>>> > >> Sent: Thursday, August 01, 2013 8:37 PM
>>>> > >> To: user@mahout.apache.org
>>>> > >> Subject: Re: How to SSVD output to generate Clusters
>>>> > >>
>>>> > >> On Thu, Aug 1, 2013 at 5:49 AM, Stuti Awasthi
>>>> > >> <stutiawasthi@hcl.com>
>>>> > >> wrote:
>>>> > >>
>>>> > >> > I think there is a problem because of NamedVector as after
some
>>>> > >> > search I get this Jira.
>>>> > >> > https://issues.apache.org/jira/browse/MAHOUT-1067
>>>> > >> >
>>>> > >>
>>>> > >> Note also that this bug is fixed in 0.8
>>>> > >>
>>>> > >>
>>>> > >> ::DISCLAIMER::
>>>> > >>
>>>> > >> -------------------------------------------------------------------
>>>> > >> --
>>>> > >> -------------------------------------------------------------------
>>>> > >> --
>>>> > >> ----------
>>>> > >>
>>>> > >> The contents of this e-mail and any attachment(s) are confidential
>>>> > >> and intended for the named recipient(s) only.
>>>> > >> E-mail transmission is not guaranteed to be secure or error-free
as
>>>> > >> information could be intercepted, corrupted, lost, destroyed,
>>>> > >> arrive late or incomplete, or may contain viruses in transmission.
>>>> > >> The e mail and its contents (with or without referred errors)
shall
>>>> > >> therefore not attach any liability on the originator or HCL
or its
>>>> > >> affiliates.
>>>> > >> Views or opinions, if any, presented in this email are solely
those
>>>> > >> of the author and may not necessarily reflect the views or
opinions
>>>> > >> of HCL or its affiliates. Any form of reproduction, dissemination,
>>>> > >> copying, disclosure, modification, distribution and / or
>>>> > >> publication of this message without the prior written consent
of
>>>> > >> authorized representative of HCL is strictly prohibited. If
you
>>>> > >> have received this email in error please delete it and notify
the
>>>> > >> sender immediately.
>>>> > >> Before opening any email and/or attachments, please check them
for
>>>> > >> viruses and other defects.
>>>> > >>
>>>> > >>
>>>> > >> -------------------------------------------------------------------
>>>> > >> --
>>>> > >> -------------------------------------------------------------------
>>>> > >> --
>>>> > >> ----------
>>>> > >>
>>>> > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message