mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edith Au <edith...@gmail.com>
Subject Re: RowSimilarityJob with sparse matrix skips rows
Date Tue, 22 Jul 2014 18:39:06 GMT
BTW, what is the min sparsity for a DRM?


On Tue, Jul 22, 2014 at 11:19 AM, Edith Au <edith.au@gmail.com> wrote:

> You mentioned a matrix decomposition technique.  Should I run the SVD job
> instead of RowSimilarityJob?  I found this page describes the SVD job and
> it seems like that's what I should try.  However, I notice the SVD job does
> not need a similarity class as input.  Would the SVD job returns a DRM with
> Similarity vectors?  Also, I am not sure how to determine the decomposition
> rank.  In the book example above, would the rank be 600?
>
> https://mahout.apache.org/users/dim-reduction/dimensional-reduction.html
>
>
> I see your point on using other information (ie browsing history) to
> "boost" correlation.   This is something I will try after my demo deadline
> (or if I could not find a way to solve the DRM sparsity problem).   BTW, I
> took the Solr/Mahout combo approach you described in your book.  It works
> very well for the cases where a mahout Similarity vector is present.
>
> Thanks for your help.  Much appreciated
> Edith
>
>
> On Tue, Jul 22, 2014 at 9:12 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
>> Having such sparse data is going to make it very difficult to do anything
>> at all.  For instance, if you have only one non-zero in a row, there is no
>> cooccurrence to analyze and that row should be deleted.  With only two
>> non-zeros, you have to be very careful about drawing any inferences.
>>
>> The other aspect of sparsity is that you only have 600 books.  That may
>> mean that you would be better served by using a matrix decomposition
>> technique.
>>
>> One question I have is whether you have other actions besides purchase
>> that
>> indicate engagement with the books.  Can you record which users browse a
>> certain book?  How about whether they have read the reviews?
>>
>>
>>
>> On Tue, Jul 22, 2014 at 8:59 AM, Edith Au <edith.au@gmail.com> wrote:
>>
>> > Hi
>> >
>> > My RowSimiliarityJob returns a DRM with some rows missing.   The input
>> file
>> > is very sparse.  there are about 600 columns but only 1 - 6 would have a
>> > value (for each row).   The output file has some rows missing.  The
>> missing
>> > rows are the ones with only 1 - 2 values filled.  Not all rows with 1
>> or 2
>> > values are missing, just some of them.  And the missing rows are not
>> always
>> > the same for each RowSimilarityJob execution
>> >
>> > What I would like to achieve is to find the relative strength between
>> > rows.  For example, if there are 600 books, user1  and user2 like only
>> one
>> > book (the same book), then there should be a correlation between these 2
>> > users.
>> >
>> > But my RowSimilarityJob output file seems to skip some of the users with
>> > sparse preferences.  I am running the job locally with 4 options: input,
>> > output, SIMILARITY_LOGLIKELIHOOD, and temp dir.   What would be the
>> right
>> > approach to pick up similarity between users with sparse preferences?
>> >
>> > Thanks!
>> >
>> > Edith
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message