mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Mitchell <>
Subject Re: Generating similarity file(s) for item recommender?
Date Wed, 04 Jul 2012 13:50:30 GMT
Hi Sean,

Myrrix does look interesting! I'll keep an eye on it.

What I'd like to do is recommend items to users yes. I looked at the
IdRescorer and it did the job perfectly (pre filtering).

I was a little misleading in regard to the size of the data. The raw
data files are around 1GB. But after the interesting data is extracted
-- session-id, item-id and type-of-event (product image clicked,
product description viewed etc.), the data file comes out to about
10MB. Not so bad.

Btw, just bought the Mahout in Action book!

- Matt

On Tue, Jul 3, 2012 at 10:40 AM, Sean Owen <> wrote:
> I'm not sure if Mridul's suggestion does what you want. Do you want to
> recommend items to users? then no, you do not start with item IDs and
> recommend to them.
> It sounds like your question is how to compute similarity data. The
> first answer is that you do not use Hadoop unless you must use Hadoop.
> You don't compute it yourself, you let the framework do it with
> LogLikelihoodSimilarity. It just happens automatically. You can use
> caching, you can use precomputation, but that comes after you decide
> that you have too much data to do it all in real-time.
> 1GB of input data suggests you have a lot of data. Is that tens of
> millions of user-item associations? then yes you are not in simple
> non-Hadoop land anymore and you need to look at RecommenderJob /
> Hadoop. This doesn't have anything to do with FileDataModel or the
> non-distributed bits.
> To your second point -- this is really what Rescorer does for you,
> lets you filter or boost certain results at query time. But this is
> part of the non-distributed code. You could try stitching together
> some offline similarities from the Hadoop job, and loading them
> selectively in memory as part of the real-time Recommender, but it's
> going to be a bit dicey to get it to work fast.
> I don't mind mentioning that this is exactly the kind of problem I'm
> working on in Myrrix ( It does the offline model building
> on Hadoop and still lets you do real-time recommendations, with
> Rescorer objects if you want. The whole point is to fix up this
> "dicey" hard part mentioned above. Might take a look.
> On Tue, Jul 3, 2012 at 3:15 PM, Matt Mitchell <> wrote:
>> Thanks Mridul, I'll try this out. Does getItemIDs return every item id
>> from the file in your example?
>> This kind of leads me to another, related question... I want to have
>> my recommender engine recommend items to a user, but the items should
>> be from a known set of item ids. For example, if a user is doing a
>> search for "gaming system", I only want recommendations for "gaming
>> system" items. I was thinking I could feed the recommendation engine a
>> set of item IDs that are known to be "gaming systems" as a candidate
>> set *when executing that actual recommendation*. Does this make sense?
>> If so, do you know how I can do this? I basically want to constrain
>> the recommendations to a set of known item IDs at recommendation time.
>> Thanks again!
>> - Matt
>> On Tue, Jul 3, 2012 at 8:01 AM, Mridul Kapoor <> wrote:
>>>> I'm thinking the session ID (in the cookie) would be used as the user ID.
>>>> The events
>>>> are tied to product IDs, so these would be used in generating the
>>>> preferences.
>>> I guess if you consider product-preference on a per session-basis (i.e.
>>> only items for which a user expresses preference for, in a single session,
>>> are similar to each other, in some way or the other). This way, you would
>>> be considering the session-ids as dummy user-ids, which I think should be
>>> good.
>>> I'd like to eventually run this on Hadoop, but it'd also be nice to know if
>>>> there is a way to do this locally, while developing the app, maybe using
>>>> smaller
>>>> dataset?
>>> Yes just writing a small offline recommender (made to run on a local
>>> machine) should do; you could take a subset of the data, use a
>>> FileDataModel, then do something like
>>> LongPrimitiveIterator itemIDs = dataModel.getItemIDs();
>>> and iterate over these; getting _n_ recommended items for each, storing
>>> them somewhere (and maybe use this evaluating the recommender somehow)
>>> Best,
>>> Mridul

View raw message