mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Musselman <>
Subject Re: solr-recommender, recent changes to ToItemVectorsMapper
Date Mon, 05 Aug 2013 00:49:57 GMT
+1 on vector properties

On Aug 4, 2013, at 5:34 PM, Pat Ferrel <> wrote:

> It does bring up a nice way to order the items in the A and B docs, by timestamp if available.
That way when you get an h_b doc from B for the query:
> recommend based on behavior with regard to B items and actions h_b
>      query is [b-b-links: h_b]
> the h_b items are ordered by recency. You can truncate based on the number of actions
you want to consider. This should be very easy to implement if only we could attach data to
the items in the DRMs
> Actually this brings up another point that I've harped on before. It sure would be nice
to have a vector representation where you could attache arbitrary data to items or vectors.
Not so memory efficient but it makes things like ID translation and timestamping actions trivial.
If these could be attached and survive all the Mahout jobs there would be no need for the
in-memory hashmap I'm using to translate IDs and the actions could be timestamped or other
metadata could be attached. At present I guess everyone knows that only weights are attached
to actions/matrix values and in some cases names to rows/vectors in DRMs. 
> On Aug 4, 2013, at 12:59 PM, Ted Dunning <> wrote:
> On Sun, Aug 4, 2013 at 9:35 AM, Pat Ferrel <> wrote:
>> 2) This is not ideal way to downsample if I understand the code. It keeps
>> the first items ingested which has nothing to do with their timestamp.
>> You'd ideally want to truncate based on the order the actions were taken by
>> the user keeping the newest.
> There are at least three options for down-sampling.  All have arguments in
> their favor and probably have good applications.  I don't think it actually
> matters, however, since down-sampling should mostly be applied to
> pathological cases like bots or QA teams.
> The options that I know of include:
> 1) take the first events you see.  This is easy.  For content, it may be
> best to do this because this gives you information about the context of the
> content when it first appears.  For users, this may be worst as a
> characterization of the user now, but it may be near best for the off-line
> item-item analysis because it preserves a densely sampled view of some past
> moment in time.
> 2) take the last events you see.  This is also easy, but not quite as easy
> as (1) since you can't stop early if you see the data in chronological
> order.  For content, this gives you the latest view of the content and
> pushes all data for all items into the same time frame which might increase
> overlap in the offline analysis. For users at recommendation, it is
> probably exactly what you want.
> 3) take some time-weighted sampling that is in-between these two options.
> You can do reservoir sampling to get a fair sample or you can to random
> replacement which weights the recent past more heavily than the far past.
> Both of these are attractive for various reasons.  The strongest argument
> for recency weighted sampling is probably that it is hard to decide between
> (1) and (2).
> As stated above, however, this probably doesn't much matter since the
> sampling being done in the off-line analysis is mostly only applied to
> crazy users or stuff so popular that any sample will do.

View raw message