mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Scaling up spark Iitem similarity on big data data sets
Date Thu, 23 Jun 2016 18:20:12 GMT
This actually sounds like a very small problem.

My guess is that there are bad settings for the interaction and frequency
cuts.



On Thu, Jun 23, 2016 at 11:07 AM, Pat Ferrel <pat@occamsmachete.com> wrote:

> In addition to increasing downsampling there are some other things to
> note. The original OOM was caused by the use of BiMaps to store your row
> and column ids. These will increase with the size of the total storage
> needed for 2 hashmaps per id type. With only 16g you may have very little
> else for the running algo. These data structures are needed for each
> executor and the driver so won’t benefit from more executors. The only way
> to decrease this is to decrease the size of you id strings somehow. I would
> be surprised if you can run the algo with that much data in 16g.
>
> I’m working on an implementation that uses RDDs and joins to be more
> memory efficient when translating ids but it will run a fair bit slower. in
> this case increasing the partition number and number of executors would
> help speed it up.
>
> On Jun 23, 2016, at 7:01 AM, Sebastian <ssc@apache.org> wrote:
>
> Hi,
>
> Pairwise similarity is a quadratic problem and its very easy to run into a
> problem size does not scale anymore, especially with so many items. Our
> code downsamples the input data to help with this.
>
> One thing you can do is decrease the argument maxNumInteractions to a
> lower number to increase the amount of downsampling. Another thing you can
> do is to remove the items with the highest amount of interactions from the
> dataset as they are not very interesting usually (everybody knows the
> topsellers already) and heavily impact the computation.
>
> Best,
> Sebastian
>
>
> On 23.06.2016 15:47, jelmer wrote:
> > Hi,
> >
> > I am trying to build a simple recommendation engine using spark item
> > similarity (eg with
> > org.apache.mahout.math.cf.SimilarityAnalysis.cooccurrencesIDSs)
> >
> > Things work fine on comparatively small dataset but I am having
> difficulty
> > scaling it up
> >
> > The input I am using is CSV data containing 19.988.422 view item events
> > produced by 1.384.107 users. Looking at 5.135.845 distinct products
> >
> > The csv data is stored on hdfs and is split up over 15 files,
> consequently
> > the resultant RDD will have 15 partitions.
> >
> > After tweaking some parameters I did manage to get the job to run without
> > going out of memory but the job takes a very very long time to run
> >
> > After running for 15 hours it still is stuck on
> >
> > org.apache.spark.rdd.RDD.flatMap(RDD.scala:332)
> >
> org.apache.mahout.sparkbindings.blas.AtA$.at_a_nongraph_mmul(AtA.scala:254)
> > org.apache.mahout.sparkbindings.blas.AtA$.at_a(AtA.scala:61)
> >
> org.apache.mahout.sparkbindings.SparkEngine$.tr2phys(SparkEngine.scala:325)
> >
> org.apache.mahout.sparkbindings.SparkEngine$.tr2phys(SparkEngine.scala:339)
> >
> org.apache.mahout.sparkbindings.SparkEngine$.toPhysical(SparkEngine.scala:123)
> >
> org.apache.mahout.math.drm.logical.CheckpointAction.checkpoint(CheckpointAction.scala:41)
> > org.apache.mahout.math.drm.package$.drm2Checkpointed(package.scala:95)
> >
> org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:145)
> >
> org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:143)
> > scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> > scala.collection.Iterator$class.foreach(Iterator.scala:727)
> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> > scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
> > scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
> > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> scala.collection.AbstractIterator.to(Iterator.scala:1157)
> > scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
> > scala.collection.AbstractIterator.toList(Iterator.scala:1157)
> >
> >
> > I am using spark on yarn and containers cannot use more than 16gb
> >
> > I figured I would be able to speed things up by throwing a larger number
> of
> > executors at the problem. but so far that is not working out very well
> >
> > I tried assigning 500 executors and repartitioning the input data to 500
> > partitions and even changing the spark.yarn.driver.memoryOverhead to
> crazy
> > values (half of the heap) did not resolve this.
> >
> > Could someone offer any guidance on how to best speed up item similarity
> > jobs ?
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message