mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Latent Semantic Analysis
Date Mon, 04 Jun 2012 17:46:58 GMT
Sorry. The following must read

> the topic. There's an eigenspokes *_paper_* which pretty much is devoted


On Mon, Jun 4, 2012 at 10:44 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> RE: #2: I'd suggest to read LSA papers (Deerwester's, Dumais, they had
> more than one of them) to see how they address efficacy analysis of
> LSA there.
> SSVD is nothing but an SVD method, Mahout SVD's accuracy analysis is
> part of Nathan Halko's dissertation (linked to under "Papers" here:
> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition).
>
> RE:#1: I am not sure i read any work actually trying to figure
> clusters on LSA outputs. Which may just mean i didn't read enough on
> the topic. There's an eigenspokes value which pretty much is devoted
> to sphere-projected clusters produced by SVD on the social data, but i
> don't think they included LSA output in any of their claims. However,
> you may want to check that paper out. LSA is more about
> recall/precision/semantic distance hints (such as context-based
> polisemy) rather than topic clustering. However, *i think,* if
> there're any eigenspoke "clusters" in the LSA output, they better be
> projected on the sphere first in order to detect them more clearly.
> (see hyperspherical coordinates). I never did the latter so that's
> just my guess. check out the papers for more info.
>
> -d
>
>
>
> On Mon, Jun 4, 2012 at 12:11 AM, Peyman Mohajerian <mohajeri@gmail.com> wrote:
>> So now that LSA works but clustering of two newsgroups is not accurate
>> based on my subjective observation. I had two questions:
>> 1) Does it make sense to use Canopy before k-mean step to get a better idea
>> of the number of clusters or the output from SSVD can help in that regard?
>> Currently I pass the number of clusters as input parameter.
>> 2) What is a good way to assess the accuracy of the result, is there some
>> data set that is already clustered with certain tuning parameter that I can
>> use to gain some confidence? Using Newsgroups of different topics may not
>> be the best input since we aren't doing a regular clustering based on word
>> count.
>>
>> Thanks
>> Peyman
>>
>> On Fri, Apr 6, 2012 at 1:05 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>
>>> Ok, cool.
>>>
>>> I think writing MR output into your input folder is not a good
>>> practice in general in Hadoop world regardless of a job. Glad you had
>>> it resolved.
>>>
>>> On Fri, Apr 6, 2012 at 9:55 AM, Peyman Mohajerian <mohajeri@gmail.com>
>>> wrote:
>>> > Dmitriy,
>>> >
>>> > I did downgrade my hadoop and got the same error; however your last
>>> > suggestion worked, I moved the output path to a whole different directory
>>> > and this particular problem went away.
>>> >
>>> > Thanks Much,
>>> > Peyman
>>> >
>>> > On Thu, Apr 5, 2012 at 12:38 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>>> wrote:
>>> >
>>> >> also i notice that you are using output as a subfolder of your input?
>>> >> if so, it is probably going to create some mess. If so, please don't
>>> >> use folders for input and output spec which are nested w.r.t. each
>>> >> other. This is not expected.
>>> >>
>>> >> -d
>>> >>
>>> >> On Thu, Apr 5, 2012 at 12:00 PM, Peyman Mohajerian <mohajeri@gmail.com>
>>> >> wrote:
>>> >> > Ok, great, I'll give these ideas a try later today, the input is
the
>>> >> > following line(s) that in my code sample was commented out using
';'
>>> in
>>> >> > Clojure.
>>> >> >  The first stage, Q-job is done fine, it is the second job that
gets
>>> >> messed
>>> >> > up, the output of Q-job is at:
>>> >> > /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job and
>>> >> > /lsa4solr/matrix/14099700861483/transpose-213/SSVD-out/Q-job but
>>> BtJob is
>>> >> > looking for the input in the wrong place, it must be hadoop version
as
>>> >> you
>>> >> > said.
>>> >> >
>>> >> > input path  #<Path
>>> >> > hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120>
>>> >> > dd  #<Path[] [Lorg.apache.hadoop.fs.Path;@5563d208>
>>> >> > numCol  1000
>>> >> > numrow  15982
>>> >> >
>>> >> >
>>> >> > On Thu, Apr 5, 2012 at 11:54 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>>> >> wrote:
>>> >> >
>>> >> >> Another idea i have is to try to run it from just Mahout command
>>> line,
>>> >> >> see if it works with .205. If it does, it is definitely something
>>> >> >> about passing parameters in/client hadoop classpath/ etc.
>>> >> >>
>>> >> >> On Thu, Apr 5, 2012 at 11:51 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>> >
>>> >> >> wrote:
>>> >> >> > also you are printing your input path -- how does it look
like in
>>> >> >> > reality? because this path that it complains about,
>>> SSVDOutput/data,
>>> >> >> > in fact should be the input path. That's what's perplexing.
>>> >> >> >
>>> >> >> > We are talking hadoop job setup process here, nothing
specific to
>>> the
>>> >> >> > solution itself. And job setup/directory management fails
for some
>>> >> >> > reason.
>>> >> >> >
>>> >> >> > On Thu, Apr 5, 2012 at 11:45 AM, Dmitriy Lyubimov <
>>> dlieu.7@gmail.com>
>>> >> >> wrote:
>>> >> >> >> Any chance you could test it with its current dependency,
>>> 0.20.204?
>>> >> or
>>> >> >> >> that would be hard to stage?
>>> >> >> >>
>>> >> >> >> Newer hadoop version is frankly all i can think of
here for the
>>> >> reason
>>> >> >> of this.
>>> >> >> >>
>>> >> >> >> On Thu, Apr 5, 2012 at 11:35 AM, Peyman Mohajerian
<
>>> >> mohajeri@gmail.com>
>>> >> >> wrote:
>>> >> >> >>> Hi Dmitriy,
>>> >> >> >>>
>>> >> >> >>> It is a Clojure code from:
>>> https://github.com/algoriffic/lsa4solr
>>> >> >> >>> Of course I modified it to use Mahout .6 distribution,
also
>>> running
>>> >> on
>>> >> >> >>> hadoop-0.20.205.0, here is the Closure code that
I changed,
>>> >> >> >>> the lines after ' decomposer (doto (.run ssvdSolver))
' still
>>> need
>>> >> >> >>> modification b/c I'm not reading the eigenValue/Vector
from the
>>> >> solver
>>> >> >> >>> correctly.  Originally this code was based on
Mahout .4. I'm
>>> >> creating
>>> >> >> the
>>> >> >> >>> Matrix from Solr 3.1.0, very similar to what was
done on: '
>>> >> >> >>> https://github.com/algoriffic/lsa4solr'
>>> >> >> >>>
>>> >> >> >>> Thanks,
>>> >> >> >>>
>>> >> >> >>> (defn decompose-svd
>>> >> >> >>>  [mat k]
>>> >> >> >>>  ;(println "input path " (.getRowPath mat))
>>> >> >> >>>  ;(println "dd " (into-array [(.getRowPath mat)]))
>>> >> >> >>>  ;(println "numCol " (.numCols mat))
>>> >> >> >>>  ;(println "numrow " (.numRows mat))
>>> >> >> >>>  (let [eigenvalues (new java.util.ArrayList)
>>> >> >> >>>    eigenvectors (DenseMatrix. (+ k 2) (.numCols
mat))
>>> >> >> >>>    numCol (.numCols mat)
>>> >> >> >>>        config (.getConf mat)
>>> >> >> >>>    rawPath (.getRowPath mat)
>>> >> >> >>>    outputPath (Path. (str (.toString rawPath)
"/SSVD-out"))
>>> >> >> >>>    inputPath (into-array [rawPath])
>>> >> >> >>>    ssvdSolver (SSVDSolver. config inputPath
outputPath 1000 k 60
>>> 3)
>>> >> >> >>>    decomposer (doto (.run ssvdSolver))
>>> >> >> >>>    V (normalize-matrix-columns (.viewPart (.transpose
>>> eigenvectors)
>>> >> >> >>>                           (int-array
[0 0])
>>> >> >> >>>                           (int-array
[(.numCols mat) k])))
>>> >> >> >>>    U (mmult mat V)
>>> >> >> >>>    S (diag (take k (reverse eigenvalues)))]
>>> >> >> >>>    {:U U
>>> >> >> >>>     :S S
>>> >> >> >>>     :V V}))
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>> On Thu, Apr 5, 2012 at 11:10 AM, Dmitriy Lyubimov
<
>>> >> dlieu.7@gmail.com>
>>> >> >> wrote:
>>> >> >> >>>
>>> >> >> >>>> Yeah. i don't see how it may have arrived
at that error.
>>> >> >> >>>>
>>> >> >> >>>>
>>> >> >> >>>> Peyman,
>>> >> >> >>>>
>>> >> >> >>>> I need to know more -- it looks like you are
using embedded api,
>>> >> not a
>>> >> >> >>>> command line, so i need to see how you you
initialize the solver
>>> >> and
>>> >> >> >>>> also which version of Mahout libraries you
are using (your stack
>>> >> trace
>>> >> >> >>>> numbers do not correspond to anything reasonable
on current
>>> trunk).
>>> >> >> >>>>
>>> >> >> >>>> thanks.
>>> >> >> >>>>
>>> >> >> >>>> -d
>>> >> >> >>>>
>>> >> >> >>>> On Thu, Apr 5, 2012 at 10:55 AM, Dmitriy Lyubimov
<
>>> >> dlieu.7@gmail.com>
>>> >> >> >>>> wrote:
>>> >> >> >>>> > Hm. i never saw that and not sure where
this folder comes
>>> from.
>>> >> >> Which
>>> >> >> >>>> > hadoop version are you using? This may
be a result of
>>> >> incompatible
>>> >> >> >>>> > support for multiple outputs in the newer
hadoop versions . I
>>> >> tested
>>> >> >> >>>> > it with CDH3u0/u3 and it was fine. This
folder should normally
>>> >> >> appear
>>> >> >> >>>> > in the conversation, i suspect it is
an internal hadoop thing.
>>> >> >> >>>> >
>>> >> >> >>>> > This is without me actually looking at
the code per stack
>>> trace.
>>> >> >> >>>> >
>>> >> >> >>>> >
>>> >> >> >>>> > On Thu, Apr 5, 2012 at 5:22 AM, Peyman
Mohajerian <
>>> >> >> mohajeri@gmail.com>
>>> >> >> >>>> wrote:
>>> >> >> >>>> >> Hi Guys,
>>> >> >> >>>> >> I'm now using ssvd for my LSA code
and get the following
>>> error,
>>> >> at
>>> >> >> the
>>> >> >> >>>> time
>>> >> >> >>>> >> of error all I have under 'SSVD-out'
folder:
>>> >> >> >>>> >> Q-job/QHat-m-00000<
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070
>>> >> >> >>>> >&
>>> >> >> >>>> >> R-m-00000<
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070
>>> >> >> >>>> >&
>>> >> >> >>>> >> _SUCCESS<
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070
>>> >> >> >>>> >&
>>> >> >> >>>> >> part-m-00000.deflate<
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070
>>> >> >> >>>> >
>>> >> >> >>>> >>
>>> >> >> >>>> >> I'm not clear where '/data' folder
is supposed to be set, is
>>> it
>>> >> >> part of
>>> >> >> >>>> the
>>> >> >> >>>> >> output of the QJob, I don't see any
error in the QJob*?
>>> >> >> >>>> >>
>>> >> >> >>>> >> *Thanks,*
>>> >> >> >>>> >> *
>>> >> >> >>>> >> SEVERE: java.io.FileNotFoundException:
File does not exist:
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>>> >> >> >>>> >>    at
>>> >> >> >>>>
>>> >> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954)
>>> >> >> >>>> >>    at
>>> >> >> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971)
>>> >> >> >>>> >>    at
>>> >> >> org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172)
>>> >> >> >>>> >>    at
>>> >> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
>>> >> >> >>>> >>    at
>>> >> org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
>>> >> >> >>>> >>    at java.security.AccessController.doPrivileged(Native
>>> Method)
>>> >> >> >>>> >>    at javax.security.auth.Subject.doAs(Subject.java:396)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >>
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
>>> >> >> >>>> >>    at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>>> >> >> >>>> >>    at
>>> >> >> >>>>
>>> >> org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347)
>>> >> >> >>>> >>    at
>>> >> >> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142)
>>> >> >> >>>> >>    at
>>> lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72)
>>> >> >> >>>> >>    at lsa4solr.cluster$_cluster.invoke(cluster.clj:103)
>>> >> >> >>>> >>    at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown
>>> >> Source)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>>> >> >> >>>> >>    at
>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>> >> >> >>>> >>    at
>>> >> >> >>>> >>
>>> >> >>
>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>> >> >> >>>> >>
>>> >> >> >>>> >> On Sun, Feb 26, 2012 at 4:56 PM,
Dmitriy Lyubimov <
>>> >> >> dlieu.7@gmail.com>
>>> >> >> >>>> wrote:
>>> >> >> >>>> >>
>>> >> >> >>>> >>> for the third time, in context
of lsa, faster and hence
>>> perhaps
>>> >> >> better
>>> >> >> >>>> >>> alternative to lanczos is ssvd.
Is there any specific reason
>>> >> you
>>> >> >> want
>>> >> >> >>>> >>> to use lanczos solver in context
of LSA?
>>> >> >> >>>> >>>
>>> >> >> >>>> >>> -d
>>> >> >> >>>> >>>
>>> >> >> >>>> >>> On Sun, Feb 26, 2012 at 6:40
AM, Peyman Mohajerian <
>>> >> >> mohajeri@gmail.com
>>> >> >> >>>> >
>>> >> >> >>>> >>> wrote:
>>> >> >> >>>> >>> > Hi Guys,
>>> >> >> >>>> >>> >
>>> >> >> >>>> >>> > Per you advice I did upgrade
to Mahout .6 and did a bunch
>>> of
>>> >> API
>>> >> >> >>>> >>> > changes and in the meantime
realized I had a bug with my
>>> >> input
>>> >> >> >>>> matrix,
>>> >> >> >>>> >>> > zero rows read from Solr
b/c multiple fields in Solr were
>>> >> index
>>> >> >> and
>>> >> >> >>>> >>> > not just the one I was interested
in, that issues is fixed
>>> >> and
>>> >> >> I have
>>> >> >> >>>> >>> > a matrix with these dimensions:
(.numCols mat) 1000
>>> (.numRows
>>> >> >> mat)
>>> >> >> >>>> >>> > 15932 (or the transpose)
>>> >> >> >>>> >>> > Unfortunately I'm getting
the below error now, in the
>>> context
>>> >> >> of some
>>> >> >> >>>> >>> > other Mahout algorithm there
was a mention of '/tmp' vs
>>> >> '/_tmp'
>>> >> >> >>>> >>> > causing this issue but in
this particular case the matrix
>>> is
>>> >> in
>>> >> >> >>>> >>> > memory!! I'm using this
google package: guava-r09.jar
>>> >> >> >>>> >>> >
>>> >> >> >>>> >>> > SEVERE: java.util.NoSuchElementException
>>> >> >> >>>> >>> >        at
>>> >> >> >>>> >>>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
>>> >> >> >>>> >>> >        at
>>> >> >> >>>> >>>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
>>> >> >> >>>> >>> >        at
>>> >> >> >>>> >>>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
>>> >> >> >>>> >>> >        at
>>> >> >> >>>> >>>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
>>> >> >> >>>> >>> >        at
>>> >> >> >>>> >>>
>>> >> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
>>> >> >> >>>> >>> >
>>> >> >> >>>> >>> >
>>> >> >> >>>> >>> > Any suggestion?
>>> >> >> >>>> >>> > Thanks,
>>> >> >> >>>> >>> > Peyman
>>> >> >> >>>> >>> >
>>> >> >> >>>> >>> >
>>> >> >> >>>> >>> >
>>> >> >> >>>> >>> > On Mon, Feb 20, 2012 at
10:38 AM, Dmitriy Lyubimov <
>>> >> >> >>>> dlieu.7@gmail.com>
>>> >> >> >>>> >>> wrote:
>>> >> >> >>>> >>> >> Peyman,
>>> >> >> >>>> >>> >>
>>> >> >> >>>> >>> >>
>>> >> >> >>>> >>> >> Yes, what Ted said.
Please take 0.6 release. Also try
>>> ssvd,
>>> >> it
>>> >> >> may
>>> >> >> >>>> >>> >> benefit you in some
regards compared to Lanczos.
>>> >> >> >>>> >>> >>
>>> >> >> >>>> >>> >> -d
>>> >> >> >>>> >>> >>
>>> >> >> >>>> >>> >> On Sun, Feb 19, 2012
at 10:34 AM, Peyman Mohajerian <
>>> >> >> >>>> mohajeri@gmail.com>
>>> >> >> >>>> >>> wrote:
>>> >> >> >>>> >>> >>> Hi Dmitriy &
Others,
>>> >> >> >>>> >>> >>>
>>> >> >> >>>> >>> >>> Dmitriy thanks for
your previous response.
>>> >> >> >>>> >>> >>> I have a follow
up question to my LSA project. I have
>>> >> managed
>>> >> >> to
>>> >> >> >>>> >>> >>> upload 1,500 documents
from two different news groups
>>> (one
>>> >> >> about
>>> >> >> >>>> >>> >>> graphics and one
about Atheism
>>> >> >> >>>> >>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/)
to
>>> >> Solr.
>>> >> >> >>>> However my
>>> >> >> >>>> >>> >>> LanczosSolver in
Mahout.4 does not find any eigenvalues
>>> >> >> (there are
>>> >> >> >>>> >>> >>> eigenvectors as
you see in the follow up logs).
>>> >> >> >>>> >>> >>> The only things
I'm doing different from
>>> >> >> >>>> >>> >>> (https://github.com/algoriffic/lsa4solr)
is that I'm
>>> not
>>> >> >> using the
>>> >> >> >>>> >>> >>> 'Summary' field
but rather the actual 'text' field in
>>> Solr.
>>> >> >> I'm
>>> >> >> >>>> >>> >>> assuming the issue
is that Summary field already removes
>>> >> the
>>> >> >> noise
>>> >> >> >>>> and
>>> >> >> >>>> >>> >>> make the clustering
work and the raw index data does
>>> not do
>>> >> >> that,
>>> >> >> >>>> am I
>>> >> >> >>>> >>> >>> correct or there
are other potential explanations? For
>>> the
>>> >> >> desired
>>> >> >> >>>> >>> >>> rank I'm using values
between 10-100 and looking for
>>> >> #clusters
>>> >> >> >>>> between
>>> >> >> >>>> >>> >>> 2-10 (different
values for different trials), but always
>>> >> the
>>> >> >> same
>>> >> >> >>>> >>> >>> result comes out,
no clusters found.
>>> >> >> >>>> >>> >>> If my issue is related
to not having summarization done,
>>> >> how
>>> >> >> can
>>> >> >> >>>> that
>>> >> >> >>>> >>> >>> be done in Solr?
I wasn't able to fine a Summary field
>>> in
>>> >> >> Solr.
>>> >> >> >>>> >>> >>>
>>> >> >> >>>> >>> >>> Thanks
>>> >> >> >>>> >>> >>> Peyman
>>> >> >> >>>> >>> >>>
>>> >> >> >>>> >>> >>>
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20
AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Lanczos iteration
complete - now to diagonalize
>>> the
>>> >> >> >>>> tri-diagonal
>>> >> >> >>>> >>> >>> auxiliary matrix.
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20
AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector
0 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20
AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector
1 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20
AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector
2 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20
AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector
3 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20
AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector
4 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20
AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector
5 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20
AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector
6 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20
AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector
7 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20
AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector
8 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20
AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector
9 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20
AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: Eigenvector
10 found with eigenvalue 0.0
>>> >> >> >>>> >>> >>> Feb 19, 2012 3:25:20
AM
>>> >> >> >>>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver
>>> >> solve
>>> >> >> >>>> >>> >>> INFO: LanczosSolver
finished.
>>> >> >> >>>> >>> >>>
>>> >> >> >>>> >>> >>>
>>> >> >> >>>> >>> >>> On Sun, Jan 1, 2012
at 10:06 PM, Dmitriy Lyubimov <
>>> >> >> >>>> dlieu.7@gmail.com>
>>> >> >> >>>> >>> wrote:
>>> >> >> >>>> >>> >>>> In Mahout lsa
pipeline is possible with seqdirectory,
>>> >> >> seq2sparse
>>> >> >> >>>> and
>>> >> >> >>>> >>> ssvd
>>> >> >> >>>> >>> >>>> commands. Nuances
are understanding dictionary format
>>> and
>>> >> llr
>>> >> >> >>>> >>> anaylysis of
>>> >> >> >>>> >>> >>>> n-grams and
perhaps use a slightly better lemmatizer
>>> than
>>> >> the
>>> >> >> >>>> default
>>> >> >> >>>> >>> one.
>>> >> >> >>>> >>> >>>>
>>> >> >> >>>> >>> >>>> With indexing
part you are on your own at this point.
>>> >> >> >>>> >>> >>>> On Jan 1, 2012
2:28 PM, "Peyman Mohajerian" <
>>> >> >> mohajeri@gmail.com>
>>> >> >> >>>> >>> wrote:
>>> >> >> >>>> >>> >>>>
>>> >> >> >>>> >>> >>>>> Hi Guys,
>>> >> >> >>>> >>> >>>>>
>>> >> >> >>>> >>> >>>>> I'm interested
in this work:
>>> >> >> >>>> >>> >>>>>
>>> >> >> >>>> >>> >>>>>
>>> >> >> >>>> >>>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>>> >> >> >>>> >>> >>>>>
>>> >> >> >>>> >>> >>>>> I looked
at some of the comments and notices that
>>> there
>>> >> was
>>> >> >> >>>> interest
>>> >> >> >>>> >>> >>>>> in incorporating
it into Mahout, back in 2010. I'm
>>> also
>>> >> >> having
>>> >> >> >>>> issues
>>> >> >> >>>> >>> >>>>> running
this code due to dependencies on older
>>> version of
>>> >> >> Mahout.
>>> >> >> >>>> >>> >>>>>
>>> >> >> >>>> >>> >>>>> I was wondering
if LSA is now directly available in
>>> >> Mahout?
>>> >> >> Also
>>> >> >> >>>> if I
>>> >> >> >>>> >>> >>>>> upgrade
to the latest Mahout would this Clojure code
>>> >> work?
>>> >> >> >>>> >>> >>>>>
>>> >> >> >>>> >>> >>>>> Thanks
>>> >> >> >>>> >>> >>>>> Peyman
>>> >> >> >>>> >>> >>>>>
>>> >> >> >>>> >>>
>>> >> >> >>>>
>>> >> >>
>>> >>
>>>

Mime
View raw message