mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: RowSimilarity
Date Sat, 14 Jul 2012 18:16:19 GMT
Solr would do this well.  The upcoming knn package would do it differently
and for different purposes, but also would do it well.

On Sat, Jul 14, 2012 at 8:17 AM, Pat Ferrel <pat@occamsmachete.com> wrote:

> Intersting.
>
> I have another requirement, which is to do something like real time vector
> based queries. Imagine taking a doc vector, reweighting some terms then
> doing a query with it, perhaps in a truncated form. There are several ways
> to do this but only solr would offer something real time results afaik. It
> looks like I could use your approach below to do this. A quick look at
> eDisMax however suggests some problems. The use of pf2 and pf3 would jamb
> the query vector into synthesized bi and tri grams for instance.
>
> I'd be interested in hearing more about how you use it. Is there a better
> venue than the mahout list?
>
> On 7/13/12 9:41 PM, Ken Krugler wrote:
>
>> Hi Pat,
>>
>> On Jul 13, 2012, at 12:47pm, Pat Ferrel wrote:
>>
>>  I also do clustering so that's an obvious optimization I just haven't
>>> gotten to it yet (doing similar only on docs clustered together). I'm also
>>> trying to decide how to downsample. However the results from similarity are
>>> quite good so understanding how to scale is #1.
>>>
>>> Clustering gives docs closest to a centroid. RowSimilarity finds docs
>>> similar to each docs.
>>>
>>> What I really need is to calculate the k most similar docs to a short
>>> list, known ahead of time. I don't know of an algorithm to do this (other
>>> than brute force). It would take a realatively small set of docs and find
>>> similar docs in a much much larger set. Rowsimilarity finds all pair-wise
>>> similarities. Strictly speaking I need only a tiny number of those.
>>>
>>> I think lucene has a weighted verctor based search that I need to
>>> investigate it further.
>>>
>> As one point of reference, I've used Solr (Lucene) to do this, by taking
>> the set of features (small, heavily reduced) from the target doc, using
>> them (with weights) via edismax to find some top N candidate documents in
>> the Lucene index which I'd built using the same approach (small set of
>> features), and then calculating pair-wise similarity to rank the results.
>>
>> -- Ken
>>
>>  On 7/13/12 9:32 AM, Sebastian Schelter wrote:
>>>
>>>> Pat,
>>>>
>>>> RowSimilarityJob compares all pairs of rows, which is by definition a
>>>> quadratic and therefore non-scalable problem. The comparison is however
>>>> done in a way that only rows that have at least one non-zero value in a
>>>> common dimension are compared.
>>>>
>>>> Therefore if you have certain sparse types of input such as ratings for
>>>> example, you only have to look at a relatively small number of pairs and
>>>> the comparison scales.
>>>>
>>>> RowSimilarityJob is mainly used for the collaborative filtering stuff in
>>>> Mahout. We have a special job to prepare the data
>>>> (PreparePreferenceMatrixJob) that will take care of sampling down
>>>> entries in the rating matrix that might cause too much cooccurrences.
>>>>
>>>> If you directly use RowSimilarityJob, you have to ensure that your input
>>>> data is of a shape suitable for the job. It seems to me that this is not
>>>> the case, you created 76GB of intermediate output (cooccurring terms)
>>>> from 35k documents, its clear that it takes hadoop a long time to sort
>>>> that in the shuffle phase.
>>>>
>>>> My advice would be that you either take a deeper look at your data and
>>>> try to downsample highly frequent terms more, or that you take a look at
>>>> other techniques such as clustering or locality sensitive hashing to
>>>> find similar documents.
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>>
>>>>
>>>> On 13.07.2012 18:03, Pat Ferrel wrote:
>>>>
>>>>> I increased the timeout to 100 minutes and added another machine (does
>>>>> the new machine matter in this case?). The job completed successfully.
>>>>>
>>>>> You say the algorithm is non-scalable--did you mean it's not
>>>>> parallelizable? I assume I'll need to keep increasing this limit?
>>>>>
>>>>> I'm sure you know better than I that it is not really good for the
>>>>> efficiency of a cluster to increase the timeout so far since it means
>>>>> jobs can take much longer in the case of transient task failures.
>>>>>
>>>>> On 7/12/12 8:26 AM, Pat Ferrel wrote:
>>>>>
>>>>>> OK, thanks. I haven't checked for sparsity. However I have many
>>>>>> successful runs of rowsimilarity with up to 150,000 docs and 250,000
>>>>>> terms as I said below. This run has a much smaller matrix. I
>>>>>> understand that spasity is a different question but anyway since
the
>>>>>> data in all cases is a crawl of the same sites I'd expect the same
>>>>>> sparsity in all the data sets whether they succeeded or timed out.
>>>>>>
>>>>>> My issue has nothing to do with the elapsed time although I'll have
to
>>>>>> consider it in larger data sets (thanks for the heads up). Is it
>>>>>> impossible to check in with the task tracker, avoiding a timeout?
Or
>>>>>> is there some other issue?
>>>>>>
>>>>>> On 7/12/12 8:06 AM, Sebastian Schelter wrote:
>>>>>>
>>>>>>> It's important to note that the performance of RowSimilarityJob
>>>>>>> heavily depends on the sparsity of the input data, because in
general
>>>>>>> comparing all pairs of things is a quadratic (non-scalable) problem.
>>>>>>>
>>>>>>> 2012/7/12 Sebastian Schelter <ssc@apache.org>:
>>>>>>>
>>>>>>>> Sorry, I overread that its more than one machine. Could you
provide
>>>>>>>> the values for the counters from RowSimilarityJob (ROWS,
>>>>>>>> COOCCURRENCES, PRUNED_COOCCURRENCES)?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Sebastian
>>>>>>>>
>>>>>>>> 2012/7/12 Pat Ferrel <pat@occamsmachete.com>:
>>>>>>>>
>>>>>>>>> Thanks, actually there are two machines. I am testing
before
>>>>>>>>> spending on
>>>>>>>>> AWS. It's failing the test in this case.
>>>>>>>>>
>>>>>>>>> BTW I ran the same setup with 150,000 docs and 250,000
terms with a
>>>>>>>>> much
>>>>>>>>> lower timeout (30000000) all worked fine. I was using
0.6 at the
>>>>>>>>> time and
>>>>>>>>> not sure if 0.8 has ever completed a rowsimilarity of
any size.
>>>>>>>>> Small runs
>>>>>>>>> work fine on my laptop.
>>>>>>>>>
>>>>>>>>> I smell some kind of other problem than simple performance.
In any
>>>>>>>>> case in a
>>>>>>>>> perfect world isn't the code supposed to check in often
enough so
>>>>>>>>> the
>>>>>>>>> cluster config doesn't need to be tweaked for a specific
job?
>>>>>>>>>
>>>>>>>>> It may be some problem of mine, of course. I see no obvious
hadoop
>>>>>>>>> or mahout
>>>>>>>>> errors but there are many places to look.
>>>>>>>>>
>>>>>>>>> With a 100 minute timeout I am currently at the pause
between map
>>>>>>>>> and
>>>>>>>>> reduce. If it fails would you like any specific logs?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 7/11/12 4:00 PM, Sebastian Schelter wrote:
>>>>>>>>>
>>>>>>>>>> To be honest, I don't think it makes a lot of sense
to test a
>>>>>>>>>> Hadoop
>>>>>>>>>> job on a single machine. It's pretty obvious that
you will get
>>>>>>>>>> terrible performance.
>>>>>>>>>>
>>>>>>>>>> 2012/7/12 Pat Ferrel <pat@occamsmachete.com>:
>>>>>>>>>>
>>>>>>>>>>> BTW the timeout is 1800 but the task in total
runs over 9 hours
>>>>>>>>>>> before
>>>>>>>>>>> each
>>>>>>>>>>> failure. This causes the job to take (after three
tries) 27 hrs
>>>>>>>>>>> to
>>>>>>>>>>> completely fail. Oh, bother...
>>>>>>>>>>>
>>>>>>>>>>> The timeout seems to be during the last map,
so when the mappers
>>>>>>>>>>> reach
>>>>>>>>>>> 100%
>>>>>>>>>>> but still running. Maybe some kind of cleanup
is happening?
>>>>>>>>>>> The first reducer is still "pending". The reducer
never gets a
>>>>>>>>>>> chance to
>>>>>>>>>>> start.
>>>>>>>>>>>
>>>>>>>>>>> 12/07/11 11:09:45 INFO mapred.JobClient:  map
92% reduce 0%
>>>>>>>>>>> 12/07/11 11:11:06 INFO mapred.JobClient:  map
93% reduce 0%
>>>>>>>>>>> 12/07/11 11:12:51 INFO mapred.JobClient:  map
94% reduce 0%
>>>>>>>>>>> 12/07/11 11:15:22 INFO mapred.JobClient:  map
95% reduce 0%
>>>>>>>>>>> 12/07/11 11:18:43 INFO mapred.JobClient:  map
96% reduce 0%
>>>>>>>>>>> 12/07/11 11:24:32 INFO mapred.JobClient:  map
97% reduce 0%
>>>>>>>>>>> 12/07/11 11:27:40 INFO mapred.JobClient:  map
98% reduce 0%
>>>>>>>>>>> 12/07/11 11:30:53 INFO mapred.JobClient:  map
99% reduce 0%
>>>>>>>>>>> 12/07/11 11:36:35 INFO mapred.JobClient:  map
100% reduce 0%
>>>>>>>>>>> ---after a very long wait (9hrs or so) insert
fail here--->
>>>>>>>>>>>
>>>>>>>>>>> 8 core 2 machine cluster with 8G ram per machine
32,000 docs
>>>>>>>>>>> 76,000 terms
>>>>>>>>>>>
>>>>>>>>>>> Any other info you need please ask.
>>>>>>>>>>>
>>>>>>>>>>> I'm about to try cranking it up to a couple hours
for timeout
>>>>>>>>>>> but I
>>>>>>>>>>> suspect
>>>>>>>>>>> there is something else going on here.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 7/11/12 10:35 AM, Pat Ferrel wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I'm have a custom lucene stemming analyzer
that filters out stop
>>>>>>>>>>>> words
>>>>>>>>>>>> and
>>>>>>>>>>>> uses the following seq2sparse. The -x 40
is the only other thing
>>>>>>>>>>>> that
>>>>>>>>>>>> affects tossing frequent terms and as I understand
things,
>>>>>>>>>>>> tosses any
>>>>>>>>>>>> term
>>>>>>>>>>>> that appears in over 40% of the docs.
>>>>>>>>>>>>
>>>>>>>>>>>> mahout seq2sparse \
>>>>>>>>>>>>        -i b2/seqfiles/ \
>>>>>>>>>>>>        -o b2/vectors/ \
>>>>>>>>>>>>        -ow \
>>>>>>>>>>>>        -chunk 2000 \
>>>>>>>>>>>>        -x 40 \
>>>>>>>>>>>>        -seq \
>>>>>>>>>>>>        -n 2 \
>>>>>>>>>>>>        -nv \
>>>>>>>>>>>>        -a com.finderbots.analyzers.**LuceneStemmingAnalyzer
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 7/11/12 9:18 AM, Sebastian Schelter wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Pat,
>>>>>>>>>>>>>
>>>>>>>>>>>>> have you removed highly frequent terms
before launching
>>>>>>>>>>>>> rowsimilarity
>>>>>>>>>>>>> job?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 11.07.2012 18:14, Pat Ferrel wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've been trying to get a rowsimilarity
job to complete. It
>>>>>>>>>>>>>> continues
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> timeout on a RowSimilarityJob-**CooccurrencesMapper-Reducer
>>>>>>>>>>>>>> task
>>>>>>>>>>>>>> so I've
>>>>>>>>>>>>>> upped the timeout to 30 minutes now.
There are no errors in
>>>>>>>>>>>>>> the logs
>>>>>>>>>>>>>> that I can see and no other task
I've tried is acting like
>>>>>>>>>>>>>> this. Is
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> expected? Shouldn't the task check
in more often?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It's doing 34,000 docs with 40 sim
docs each on 8 cores so it
>>>>>>>>>>>>>> is a bit
>>>>>>>>>>>>>> slow anyway, still I shouldn't have
to turn up the timeout so
>>>>>>>>>>>>>> high
>>>>>>>>>>>>>> should I?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>
>>>>
>>>  --------------------------
>> Ken Krugler
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Mahout & Solr
>>
>>
>>
>>
>>
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message