mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Which database should I use with Mahout
Date Wed, 22 May 2013 00:42:49 GMT

On Tue, May 21, 2013 at 8:59 AM, Pat Ferrel <> wrote:

> In the interest of getting some empirical data out about various
> architectures:
> On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel <> wrote:
> >> ...
> >> You use the user history vector as a query?
> >
> > The most recent suffix of the history vector.  How much is used varies by
> > the purpose.
> We did some experiments with this using a year+ of e-com data. We measured
> the precision using different amounts of the history vector in 3 month
> increments. The precision increased throughout the year. At about 9 months
> the affects of what appears to be item/product/catalog/new model churn
> begin to become significant and so precision started to level off. We did
> *not* apply a filter to recs so that items not in the current catalog were
> not filtered before precision was measured. We'd expect this to improve
> results using older data.

This is a time filter.  How many transactions did this turn out to be.  I
typically recommend truncating based on transactions rather than time.

My own experience was music and video recommendations.  Long history
definitely did not help much there.

> >
> > 20 recs is not sufficient.  Typically you need 300 for any given context
> > and you need to recompute those very frequently.  If you use geo-specific
> > recommendations, you may need thousands of recommendations to have enough
> > geo-dispersion.  The search engine approach can handle all of that on the
> > fly.
> >
> > Also, the cached recs are user x (20-300) non-zeros.  The sparsified
> > item-item cooccurrence matrix is item x 50.  Moreover, search engines are
> > very good at compression.  If users >> items, then item x 50 is much
> > smaller, especially after high quality compression (6:1 is a common
> > compression ratio).
> >
> The end application designed by the ecom customer required less than 10
> recs for any given context so 20 gave us of room for runtime context type
> boosting.

And how do you generate the next page of results?

> Given that precision increased for a year of user history and that we
> needed to return 20 recs per user and per items the history matrix was
> nearly 2 orders of magnitude larger than the recs matrix. This was with
> about 5M users and 500K items over a year.

The history matrix should be at most 2.5 T bits = 300GB.  Remember, this is
a binary matrix that is relatively sparse so I would expect that a typical
size would be more like a gigabyte.

> The issue I was asking about was how to store and retrieve history vectors
> for queries. In our case it looks like some kind of scalable persistence
> store would be required and since pre-calculated reqs are indeed much
> smaller...

I am still confused about this assertion.  I think that you need <500
history items per person which is about 500 * 19bits < 1.3KB / user.  I
also think that you need 100 or more recs if you prestore them which is
also in the kilobyte range.  This doesn't sound all that different.

And then the search index needs to store 500K x 50 nonzeros = 100 MB.  This
is much smaller than either the history or the prestored recommendations
even before any compression.

> Yes using a search engine the index is very small but the history vectors
> are not. Actually I wonder how well Solr would handle a large query? Is the
> truncation of the history vector required perhaps?

The history vector is rarely more than a hundred terms which is not that
large a query.

> > Actually, it is.  Round trip of less than 10ms is common.  Precalculation
> > goes away.  Export of recs nearly goes away.  Currency of recommendations
> > is much higher.
> This is certainly great performance, no doubt. Using a 12 node Cassandra
> ring (each machine had 16G of memory) spread across two geolocations we got
> 24,000 tps to a worst case of 5000 tps. The average response for the entire
> system (which included two internal service layers and one query to
> cassandra) was 5-10ms per response.

Uh... my numbers are for a single node.  Query rates are typically about
1-2K queries/second so the speed is comparable.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message