mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: Performance Issue using item-based approach!
Date Thu, 17 Apr 2014 11:00:26 GMT
On 04/17/2014 12:57 PM, Najum Ali wrote:
> @Sebastian
>
> wow … you are right. The original csv file is about 21mb and the corresponding precomputed
item-item similarity file is about 260mb!!
> And yes, there are wide more than 50 "most similar items“ for an item ..
>
> Trying to restrict this to 50 (or something like that) most similar items for an item
could do the trick as you said.
> Ok I will give it try and reply later.
>
> By the way, what´s about the SampingCandidateItemsStrategy or something like this, by
using this Constructor:
> GenericItemBasedRecommender(DataModel dataModel, ItemSimilarity similarity, CandidateItemsStrategy
candidateItemsStrategy,MostSimilarItemsCandidateItemsStrategy mostSimilarItemsCandidateItemsStrategy)

That could also help a little, but a better way is to restrict the 
number of similar items to look at.

--sebastian


>
>
> Am 17.04.2014 um 12:41 schrieb Sebastian Schelter <ssc@apache.org>:
>
>> Hi Najum,
>>
>> I think I found the problem. Remember: Two items are similar whenever at least one
user interacted with both of them ("the items co-occur").
>>
>> In the movielens dataset this is true for almost all pairs of items, unfortunately.
From 3076 items, more than 11 million similarities are created. A common approach for that
(which is not yet implemented in our precomputation unfortunately) is to only retain the top-k
similar items per item.
>>
>> A solution would be to take the csv file that is created by the MultithreadedBatchItemSimilarities
and postprocess it so that only the 50 most similar items per item are retained. That should
help with your problem.
>>
>> Unfortunately, we don't have code for that yet, maybe you want to try to write that
yourself?
>>
>> Best,
>> Sebastian
>>
>> PS: The user-based recommender restricts the number of similar users, I guess thats
why it is so fast here.
>>
>>
>> On 04/17/2014 12:18 PM, Najum Ali wrote:
>>> Ok, here you go:
>>>
>>> I have created a simple class with main-method (no server and other stuff):
>>>
>>> public class RecommenderTest {
>>> public static void main(String[] args) throws IOException, TasteException {
>>> DataModel dataModel = new FileDataModel(new
>>> File("/Users/najum/Documents/recommender-console/src/main/webapp/resources/preference_csv/1mil.csv"));
>>> ItemSimilarity similarity = new LogLikelihoodSimilarity(dataModel);
>>> ItemBasedRecommender recommender = new GenericItemBasedRecommender(dataModel,
>>> similarity);
>>>
>>> String pathToPreComputedFile = preComputeSimilarities(recommender,
>>> dataModel.getNumItems());
>>>
>>> InputStream inputStream = new FileInputStream(new File(pathToPreComputedFile));
>>> BufferedReader bufferedReader = new BufferedReader(new
>>> InputStreamReader(inputStream));
>>> Collection<GenericItemSimilarity.ItemItemSimilarity> correlations =
>>> bufferedReader.lines().map(mapToItemItemSimilarity).collect(Collectors.toList());
>>> ItemSimilarity precomputedSimilarity = new GenericItemSimilarity(correlations);
>>> ItemBasedRecommender recommenderWithPrecomputation = new
>>> GenericItemBasedRecommender(dataModel, precomputedSimilarity);
>>>
>>> recommend(recommender);
>>> recommend(recommenderWithPrecomputation);
>>> }
>>>
>>> private static String preComputeSimilarities(ItemBasedRecommender recommender,
>>> int simItemsPerItem) throws TasteException {
>>> String pathToAbsolutePath = "";
>>> try {
>>> File resultFile = new File(System.getProperty("java.io.tmpdir"),
>>> "similarities.csv");
>>> if (resultFile.exists()) {
>>> resultFile.delete();
>>> }
>>> BatchItemSimilarities batchJob = new
>>> MultithreadedBatchItemSimilarities(recommender, simItemsPerItem);
>>> int numSimilarities =
>>> batchJob.computeItemSimilarities(Runtime.getRuntime().availableProcessors(),
1,
>>> new FileSimilarItemsWriter(resultFile));
>>> pathToAbsolutePath = resultFile.getAbsolutePath();
>>> System.out.println("Computed " + numSimilarities + " similarities and saved them
>>> to " + pathToAbsolutePath);
>>> } catch (IOException e) {
>>> System.out.println("Error while writing pre computed similarities to file");
>>> }
>>> return pathToAbsolutePath;
>>> }
>>>
>>> private static void recommend(ItemBasedRecommender recommender) throws
>>> TasteException {
>>> long start = System.nanoTime();
>>> List<RecommendedItem> recommendations = recommender.recommend(1, 10);
>>> long end = System.nanoTime();
>>> System.out.println("Created recommendations in " +
>>> getCalculationTimeInMilliseconds(start, end) + " ms. Recommendations:" +
>>> recommendations);
>>> }
>>>
>>> private static double getCalculationTimeInMilliseconds(long start, long end)
{
>>> double calculationTime = (end - start);
>>> return (calculationTime / 1_000_000);
>>> }
>>>
>>>
>>> private static Function<String, GenericItemSimilarity.ItemItemSimilarity>
>>> mapToItemItemSimilarity = (line) -> {
>>> String[] row = line.split(",");
>>> return new GenericItemSimilarity.ItemItemSimilarity(
>>> Long.parseLong(row[0]), Long.parseLong(row[1]), Double.parseDouble(row[2]));
>>> };
>>> }
>>>
>>> And thats the Output-log:
>>>
>>> 3 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel -
>>> Creating FileDataModel for file
>>> /Users/najum/Documents/recommender-console/src/main/webapp/resources/preference_csv/1mil.csv
>>> 63 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel -
>>> Reading file info...
>>> 1207 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel -
>>> Processed 1000000 lines
>>> 1208 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel - Read
>>> lines: 1000209
>>> 1475 [main] INFO org.apache.mahout.cf.taste.impl.model.GenericDataModel -
>>> Processed 6040 users
>>> 1599 [main] INFO
>>> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
>>> - Queued 3706 items in 38 batches
>>> 10928 [pool-1-thread-8] INFO
>>> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
>>> - worker 7 processed 5 batches
>>> 10928 [pool-1-thread-8] INFO
>>> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
>>> - worker 7 processed 5 batches. done.
>>> 10978 [pool-1-thread-5] INFO
>>> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
>>> - worker 4 processed 4 batches. done.
>>> 11589 [pool-1-thread-4] INFO
>>> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
>>> - worker 3 processed 5 batches
>>> 11589 [pool-1-thread-4] INFO
>>> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
>>> - worker 3 processed 5 batches. done.
>>> 11592 [pool-1-thread-6] INFO
>>> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
>>> - worker 5 processed 5 batches
>>> 11592 [pool-1-thread-6] INFO
>>> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
>>> - worker 5 processed 5 batches. done.
>>> 11707 [pool-1-thread-7] INFO
>>> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
>>> - worker 6 processed 5 batches
>>> 11707 [pool-1-thread-7] INFO
>>> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
>>> - worker 6 processed 5 batches. done.
>>> 11730 [pool-1-thread-3] INFO
>>> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
>>> - worker 2 processed 4 batches. done.
>>> 11849 [pool-1-thread-1] INFO
>>> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
>>> - worker 0 processed 5 batches
>>> 11849 [pool-1-thread-1] INFO
>>> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
>>> - worker 0 processed 5 batches. done.
>>> 11854 [pool-1-thread-2] INFO
>>> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
>>> - worker 1 processed 5 batches
>>> 11854 [pool-1-thread-2] INFO
>>> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
>>> - worker 1 processed 5 batches. done.
>>> Computed 9174333 similarities and saved them to
>>> /var/folders/9g/4h38v1tj3ps9j21skc72b56r0000gn/T/similarities.csv
>>> Created recommendations in *1683.613
>>> ms*. Recommendations:[RecommendedItem[item:3890, value:4.6771617],
>>> RecommendedItem[item:3530, value:4.662509], RecommendedItem[item:127,
>>> value:4.660716], RecommendedItem[item:3323, value:4.660716],
>>> RecommendedItem[item:3382, value:4.660716], RecommendedItem[item:3123,
>>> value:4.603366], RecommendedItem[item:3233, value:4.5707765],
>>> RecommendedItem[item:1434, value:4.553473], RecommendedItem[item:989,
>>> value:4.5263577], RecommendedItem[item:2343, value:4.524066]]
>>> Created recommendations in* 985.679
>>> ms.* Recommendations:[RecommendedItem[item:3530, value:5.0],
>>> RecommendedItem[item:3382, value:5.0], RecommendedItem[item:3890,
>>> value:4.6771617], RecommendedItem[item:127, value:4.660716],
>>> RecommendedItem[item:3323, value:4.660716], RecommendedItem[item:3123,
>>> value:4.603366], RecommendedItem[item:3233, value:4.5707765],
>>> RecommendedItem[item:1434, value:4.553473], RecommendedItem[item:989,
>>> value:4.5263577], RecommendedItem[item:2343, value:4.524066]]
>>>
>>> Again almost same results. Although what I also don´t understand is, why am
I
>>> getting different RecommendItems?
>>> That really frustrates me…
>>>
>>> You can find the Java file in the attachment.
>>>
>>>
>>>
>>> Greetings from Germany,
>>> Najum
>>>
>>> Am 17.04.2014 um 11:44 schrieb Sebastian Schelter <ssc@apache.org
>>> <mailto:ssc@apache.org>>:
>>>
>>>> Yes, just to make sure the problem is in the mahout code and not in the
>>>> surrounding environment.
>>>>
>>>> On 04/17/2014 11:43 AM, Najum Ali wrote:
>>>>> @Sebastian
>>>>> What do u mean with a standalone recommender? A simple offline java main
program?
>>>>>
>>>>> Am 17.04.2014 um 11:41 schrieb Sebastian Schelter <ssc@apache.org
>>>>> <mailto:ssc@apache.org>>:
>>>>>
>>>>>> Could you take the output of the precomputation, feed it into a standalone
>>>>>> recommender and test it there?
>>>>>>
>>>>>>
>>>>>> On 04/17/2014 11:37 AM, Najum Ali wrote:
>>>>>>> @sebastian
>>>>>>>
>>>>>>>> Are you sure that the precomputation is done only once and
not in every
>>>>>>>> request?
>>>>>>> Yes, a @Bean annotated Object is in Spring per default a singleton
instance.
>>>>>>> I also just tested it out using a System.out.println()
>>>>>>> Here is my log:
>>>>>>>
>>>>>>> System.out.println("----> precomputation done!“ is called
before returning the
>>>>>>> GenericItemSimilarity.
>>>>>>>
>>>>>>> The first two recommendations are Item-based -> pearson similarity
>>>>>>> The thrid and 4th log are also item-based using pre computed
similarity
>>>>>>> The last log is the userbased recommender using pearson
>>>>>>>
>>>>>>> Look at the huge time difference!
>>>>>>>
>>>>>>> Am 17.04.2014 um 11:23 schrieb Sebastian Schelter <ssc@apache.org
>>>>>>> <mailto:ssc@apache.org>
>>>>>>> <mailto:ssc@apache.org>>:
>>>>>>>
>>>>>>>> Najum,
>>>>>>>>
>>>>>>>> this is really strange, feeding an ItemBased Recommender
with precomputed
>>>>>>>> similarities should give you superfast recommendations.
>>>>>>>>
>>>>>>>> Are you sure that the precomputation is done only once and
not in every
>>>>>>>> request?
>>>>>>>>
>>>>>>>> --sebastian
>>>>>>>>
>>>>>>>> On 04/17/2014 11:17 AM, Najum Ali wrote:
>>>>>>>>> Hi guys,
>>>>>>>>>
>>>>>>>>> I have created a precomputed item-item-similarity collection
for a
>>>>>>>>> GenericItemBasedRecommender.
>>>>>>>>> Using the 1M MovieLens data, my item-based recommender
is only 40-50% faster
>>>>>>>>> than without precomputation (like 589.5ms instead 1222.9ms).
>>>>>>>>> But the user-based recommender instead is really fast,
it´s like 24.2ms?
>>>>>>>>> How can
>>>>>>>>> this happen?
>>>>>>>>>
>>>>>>>>> Here are more details to my Implementation:
>>>>>>>>>
>>>>>>>>> CSV File: 1M pref, 6040 Users, 3706 Items
>>>>>>>>>
>>>>>>>>> For my Implementation I´m using screenshots, because
having the good
>>>>>>>>> highlighting.
>>>>>>>>> My Recommender runs inside a Webserver (Jetty) using
Spring 4 and Java8. I
>>>>>>>>> receive Recommendations as Webservice (JSON).
>>>>>>>>>
>>>>>>>>> For DataModel, I´m using FileDataModel.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This code below creates me a precomputed ItemSimilarity
when I start the
>>>>>>>>> Webserver and the property isItemPreComputationEnabled
is set to true:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> For time measuring I´m using AOP. I´m measuring the
whole time from
>>>>>>>>> entering my
>>>>>>>>> Controller to sending the response.
>>>>>>>>> based on System.nanoTime(); and getting the diff. It´s
the same time
>>>>>>>>> measure for
>>>>>>>>> user based.
>>>>>>>>>
>>>>>>>>> I haved tried to cache the recommender and the similarity
with no big
>>>>>>>>> difference. I also tried to use CandidateItemsStrategy
and
>>>>>>>>> MostSimilarItemsCandidateItemsStrategy, but also no performance
boost.
>>>>>>>>>
>>>>>>>>> public RecommenderBuilder createRecommenderBuilder(ItemSimilarity
similarity)
>>>>>>>>> throws TasteException {
>>>>>>>>> final int numberOfUsers = dataModel.getNumUsers();
>>>>>>>>> final int numberOfItems = dataModel.getNumItems();
>>>>>>>>> CandidateItemsStrategy candidateItemsStrategy = new
>>>>>>>>> SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems);
>>>>>>>>> MostSimilarItemsCandidateItemsStrategy mostSimilarStrategy
= new
>>>>>>>>> SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems);
>>>>>>>>> return model -> new GenericItemBasedRecommender(model,
>>>>>>>>> similarity,candidateItemsStrategy,mostSimilarStrategy);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> I dont know why item-based is taking so much longer then
user-based.
>>>>>>>>> User-based
>>>>>>>>> is like fast as hell. I even tried a DataSet using 100k
Prefs, and 10Million
>>>>>>>>> (Movielens). Everytime the user-based is soo much faster
for any similarity.
>>>>>>>>>
>>>>>>>>> Hope you anyone can help me to understand this. Maybe
I´m doing something
>>>>>>>>> wrong.
>>>>>>>>>
>>>>>>>>> Thanks!! :))
>>>>>
>>>>>
>>>>
>>>
>>
>
>


Mime
View raw message