mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: MinHash/ItemBased
Date Tue, 25 Oct 2011 14:43:50 GMT
Why recommend for all users -- why not just new ones or ones that have
been updated? Yes, you're not intended to list all users into memory
if using "-u".

A very crude rule of thumb is that you can compute about 100 recs per
second on a normal machine, normal-sized data (no Hadoop). 8 machines
would crank through 8.3M recs in 3 hours at best. Hadoop is going to
be 3-4x slower than this due to its overheads.

This pipeline probably takes 10 minutes or so to finish even with 0
input; that's the Hadoop overhead. If you're trying to finish
computations in minutes, Hadoop probably isn't suitable.

But, I think this all works much much better if you can only recompute
users that have changed their prefs.


On Tue, Oct 25, 2011 at 3:27 PM, Vishal Santoshi
<vishal.santoshi@gmail.com> wrote:
> The data is big as in for a single day ( and I picked up an arbitrary day )
>
> 8,335,013  users.
> 256,010      distinct Items.
>
> I am using the Item Based Recommender ( The RecommenderJob ) , with no
> Preference ( opt in is a signal of preference , multiple opt ins are
> considered 1 )
>
>  <main-class>com.nytimes.computing.mahout.JobDriver</main-class>
>            <arg>recommender</arg>
>            <arg>--input</arg>
>            <arg>${out}/items/bag</arg>
>            <arg>--output</arg>
>            <arg>${out}/items_similarity</arg>
>            <arg>-u</arg>
>            <arg>${out}/items/users/part-r-00000</arg>
>            <arg>-b</arg>
>            <arg>-n</arg>
>            <arg>2</arg>
>            <arg>--similarityClassname</arg>
>
>  <arg>org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.TanimotoCoefficientSimilarity</arg>
>            <arg>--tempDir</arg>
>   <arg>${out}/temp</arg>
>
> Of course the Recommendations are for every user and thus
> the RecommenderJob-PartialMultiplyMapper-AggregateAndRecommendReducer is the
> most expensive of all.
> Further , not sure why the user file is taken in as a Distributed File
> especially when it may actually be a bigger file that a typical TaskTracker
> JVM memory limit.
>
>
>
> In case of MinHash , MinHashDriver
>
>       <java>
>            <job-tracker>${jobTracker}</job-tracker>
>            <name-node>${nameNode}</name-node>
>             <prepare>
>                <delete path="${out}/minhash"/>
>            </prepare>
>            <configuration>
>                <property>
>                    <name>mapred.job.queue.name</name>
>                    <value>${queueName}</value>
>                </property>
>            </configuration>
>            <main-class>com.nytimes.computing.mahout.JobDriver</main-class>
>            <arg>minhash_local</arg>
>            <arg>--input</arg>
>            <arg>${out}/bag</arg>
>            <arg>--output</arg>
>            <arg>${out}/minhash</arg>
>            <arg>--keyGroups</arg>  <!-- Key Groups -->
>            <arg>2</arg>
>            <arg>-r</arg>  <!-- Number of Reducers -->
>            <arg>40</arg>
>            <arg>--minClusterSize</arg> <!-- A legitimate cluster
must have
> this number of members -->
>            <arg>5</arg>
>            <arg>--hashType</arg> <!-- murmur and linear are the
other 2
> options -->
>            <arg>polynomial</arg>
>        </java>
>
> This of course scales. I still have to work with the clusters created and a
> fair amount of work has to be done to figure out which cluster is relevant.
>
>
> A week  of data in this case created the MinHash on our cluster in about 20
> minutes.
>
>
> Regards.
>
>
> On Tue, Oct 25, 2011 at 10:07 AM, Sean Owen <srowen@gmail.com> wrote:
>
>> Can you put any more numbers around this? how slow is slow, how big is big?
>> What part of Mahout are you using -- or are you using Mahout?
>>
>> Item-based recommendation sounds fine. Anonymous users aren't a
>> problem as long as you can distinguish them reasonably.
>> I think your challenge is to have a data model that quickly drops out
>> data from old items and can bring new items in.
>>
>> Is this small enough to do in memory? that's the simple, easy place to
>> start.
>>
>> On Tue, Oct 25, 2011 at 2:59 PM, Vishal Santoshi
>> <vishal.santoshi@gmail.com> wrote:
>> > Hello Folks,
>> >                  The Item Based Recommendations for my dataset is
>> > excruciatingly slow on a 8 node cluster. Yes the number of items is big
>> and
>> > the dataset churn does not allow for a long asynchronous process.
>> > Recommendations cannot be stale ( a 30 minute delay is stale ). I have
>> tried
>> > out MinHash clustering and that is scalable, but without a "degree of
>> > association" with multiple clusters any user may belong to , it seems
>> less
>> > tight that pure item based ( and thus similarity probability ) algorithm.
>> >
>> > Any ideas how we pull this off., where
>> >
>> > * The item churn is frequent. New items enter the dataset all the time.
>> > * There is no "preference" apart from opt in.
>> > * Very frequent anonymous users enter the system almost all the time.
>> >
>> >
>> > Scale is very important.
>> >
>> > I am tending towards MinHash with additional algorithms that are executed
>> > offline and co occurance.
>> >
>>
>

Mime
View raw message