mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: mapreduce ItemSimilarity input optimization
Date Sat, 16 Aug 2014 16:16:34 GMT
The Spark version “spark-itemsimilarity” uses _your_ IDs. It is ready to try and I’d
love it if you could. The IDs are kept in a HashBiMap in memory on each cluster machine and
so it's memory limited to the size of the dictionary but in practice that will probably work
for many (most) applications. This conversion of your ID into Mahout ID is done in the job
and in parallel so it's about as fast as can be though we may be able to optimize the memory
footprint in time.

run “mahout spark-itemsimilarity” to get a full list of options. You can specify some
form of text-delimited format for input—the default uses [\t, ] for the delimiter and expects
(userID,itemID,ignored-text) but you can specify which column in the TDF contains which ID
and even use filters to capture only the lines with data if you are using log files.

I’ll see if I can get a doc up on the mahout site to explain it a bit better.

As to providing input to Mahout in binary form, the Hadoop version of “rowsimilarity”
takes a DRM sequence file. This would be a row per user containing a Mahout userID and Mahout
SparseVector of the item interactions. You will still have to convert IDs though.

On Aug 16, 2014, at 5:10 AM, Serega Sheypak <> wrote:

Hi, We are trying calculate ItemSimilarity.
Right now we have 2*10^7 input lines. I do provide input data as raw text
each day to recalculate item similarities. We do get +100..1000 new items
each day.
1. It takes too much time to prepare input data.
2. It takes too much time to convert user_id, item_id to mahout ids

Is there any poissibility to provide data to mahout mapreduce
ItemSimilarity using some binary format with compression?

View raw message