Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2102\))
Subject: Re: RowSimilarity API -- illegal argument exception from
 org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()
From: Pat Ferrel <pat@occamsmachete.com>
In-Reply-To: 
 <f162681730a2422c940b92515ac9cb58@TSS-EX2013-1.ad.trilliumstaffing.com>
Date: Thu, 16 Jul 2015 13:35:09 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <CEFAE604-0ABE-44CC-AF65-F97A4254445E@occamsmachete.com>
References: 
 <b97d5cad1c234a969332ad026100032d@TSS-EX2013-1.ad.trilliumstaffing.com>
 <50f7ded4ff6146dcb22524c136aca23d@TSS-EX2013-1.ad.trilliumstaffing.com>
 <08953C8B-89BE-4141-828B-6DF883F89601@occamsmachete.com>
 <62b5885c26a04b6591f613f44e347d93@TSS-EX2013-1.ad.trilliumstaffing.com>
 <1158DF55-0E78-4B60-9C9E-B54B70256DCD@occamsmachete.com>
 <157fd19a32534f8eb11a46174eb8c167@TSS-EX2013-1.ad.trilliumstaffing.com>
 <0CCAAF7C-E3E6-446D-9EA1-7FDA32482E64@occamsmachete.com>
 <E3D570B6-D9E0-4627-BF78-F911FC37489A@occamsmachete.com>
 <38bdc430923f448a9e80f53bbac8d206@TSS-EX2013-1.ad.trilliumstaffing.com>
 <770200EC-4A8F-4378-AEC0-FFA6D5A85F56@occamsmachete.com>
 <f162681730a2422c940b92515ac9cb58@TSS-EX2013-1.ad.trilliumstaffing.com>
To: "user@mahout.apache.org" <user@mahout.apache.org>

I can=E2=80=99t exactly reproduce your environment and I=E2=80=99m also =
unable to reproduce your error using the CLI. if you take that snippet =
of data from your program and use the CLI to read it, does the error =
still occur because when I try everything is fine.

But like I said, I can=E2=80=99t use a clustered yarn.

Here=E2=80=99s a snippet you can try, a trait that I attach to my Scala =
App to get some Mahout/Spark setup. It will create a SparkContext inside =
of mahoutSparkContext so do all your job config first.

/** Put things here that setup the context for typical execution with =
Spark. This
  * should be mixed in to the object executing rdd operations to provide =
implicit config and context
  * values.
  */
abstract trait SparkJobContext {
  implicit protected var sparkConf =3D new SparkConf()
  implicit protected var sc: SparkContext =3D _
  implicit var mc: SparkDistributedContext =3D _

  def setupSparkContext(master: String =3D "local", sem: String =3D =
"3G",
    appName: String =3D this.getClass.getSimpleName, customJars: String =
=3D ""): Unit =3D {
    val closeables =3D new java.util.ArrayDeque[Closeable]()

    sparkConf.set("spark.kryo.referenceTracking", "false")
      .set("spark.kryoserializer.buffer.mb", "200")
    val jars =3D customJars.split(",").toTraversable
    mc =3D mahoutSparkContext(masterUrl =3D master, appName =3D appName, =
customJars =3D jars, sparkConf =3D sparkConf)
    sc =3D mc.sc
  }
}

On Jul 13, 2015, at 10:37 AM, Hegner, Travis <THegner@trilliumit.com> =
wrote:

I've added the Kryo registrator configs to my SparkConf and get the same =
results. I am very new to Mahout so I was not aware of the requirement =
for those.

I did have some unused imports in there if that is what you meant by the =
"distributed context" comment. I've pushed a couple of updates to add =
the configs you mentioned, and remove the unused imports.

Keep in mind that the program I linked to has the sole purpose of =
reproducing the exception I'm experiencing. I am not using int.max in my =
actual driver program, I am using just the defaults of 50 and 500. I =
only put those into this program to make it easier to reproduce the =
exception. The reproduction program is as stripped down and simple as =
possible while still producting the exception to attempt to aid in =
troubleshooting this thing. Is there some documentation somewhere on the =
recommended minimum sizes for those parameters given the size of a =
dataset? I'm sure that question is dataset specific, but some general =
guidelines could be helpful if they exist somewhere so that I'm not =
burning CPU for no reasonable difference in accuracy.

Given your suggestions, I am still getting the same exception. =
Everything for the spark instance on my cloudera cluster is the default. =
Would it still be helpful to see a dump of information from somewhere? =
The 'Environment' tab from the job's web interface? I typically try to =
let everything run with defaults, until I need to make/test something =
more specific. I guess it's how I learn to use the software. I am =
running this command to submit the job:

spark-submit --class com.travishegner.RowSimTest.RowSimTest =
RowSimTest-0.0.1-SNAPSHOT.jar

The only difference in the calling command for my real driver program is =
a "--jars" option to distribute a dependency.

Thanks again for the help!

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Monday, July 13, 2015 12:36 PM
To: user@mahout.apache.org
Cc: Dmitriy Lyubimov
Subject: Re: RowSimilarity API -- illegal argument exception from =
org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

You aren=E2=80=99t setting up the Mahout Kryo registrator. Without this =
I=E2=80=99m surprised it runs at all. Make sure the Spark settings use =
these values:

"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": =
"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryo.referenceTracking": "false",
"spark.kryoserializer.buffer.mb": "300",
"spark.executor.memory": =E2=80=9C4g=E2=80=9D // or something larger =
than default

Not sure if the distributed context is needed too, maybe Dmitriy knows =
more.

BTW I wouldn=E2=80=99t use Int.max. The calculation will approach O(n^2) =
with virtually no effective gain in accuracy and my even cause problems.

If none of this helps I can set up yarn on my dev machine, can you give =
me the spark-submit CLI and all Spark settings?


On Jul 13, 2015, at 7:51 AM, Hegner, Travis <THegner@trilliumit.com> =
wrote:

So, I've yet to be able to reproduce this with "--master local" or =
"--master local[8]", it has only occurred on my cloudera/spark/yarn =
cluster with both "--master yarn-client" and "--master yarn-cluster". I =
don't have a spark standalone cluster to test against.

I have put up a test program on my github account which contains hard =
coded test data: https://github.com/travishegner/RowSimTest. My pom.xml =
is including the mahout libraries into my final jar via shade in order =
to test against my own version of mahout (actually your's right now =
Pat!), rather than the one built into the cluster.

With this dataset the exception is sporadic (50% maybe) with the default =
params for "maxInterestingSimilaritiesPerRow" and =
"maxObservationsPerRow", but when I pass Int.MaxValue for each of those =
it seems to occur more regularly, but still succeeds at times. =
Sometimes, my driver program will throw the exception, but retry the =
failed task and continue on to complete the program successfully, other =
times it will completely fail after too many retries. I can literally =
run the same jar back-to-back without recompiling and get different =
results. I also ruled out a hardware issue by decommisioning the Yarn =
NodeManager service on all but one of my nodes to isolate it to a single =
node. I did that again on a separate node with similar results. The =
frequency of the exception is directly related to the size of the =
dataset. The smaller I make the dataset, the more often it succeeds, and =
I have yet to get a successful execution with a large enough subset of =
my full dataset.

Interestingly enough, if I map the IDS into flipped values (<tag>, =
<doc_id>) and run it through the cooccurrencesIDSs() method, it never =
fails (see the commented code block). If I run the reverse mapping  =
through rowSimilarityIDS(), it still fails in the same way.

Can you recommend any other troubleshooting steps to try? Is there any =
more information that I can provide?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Sunday, July 12, 2015 8:18 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from =
org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I tried a couple datasets this weekend and could not get the error to =
reproduce. Could you share some data or the code that creates the =
IndexedDataset?

I wonder whether the IndexedDataset is created correctly, it will =
construct from rdd and two BiDictionaries but that doesn=E2=80=99t mean =
they have correctly formatted values. It needs a Mahout DRM in the rdd, =
which means int keys and vector values with two BiDictionaries for key =
<-> string mappings for column and row. Also the int keys need to be =
contiguous ints 0..n


On Jul 10, 2015, at 11:40 AM, Pat Ferrel <pat@occamsmachete.com> wrote:

The IndexDataset creates two BiDictionaries (Bi-directional =
dictionaries) of Int <-> String so if it can be a String the element ids =
have no other restrictions.

May indeed be a bug I=E2=80=99ll look at is asap, since it passes the =
scala tests, any data you can spare might help but if you are doing a =
lot of prep, maybe that=E2=80=99s not so easy?

On Jul 10, 2015, at 11:16 AM, Hegner, Travis <THegner@trilliumit.com> =
wrote:

I am actually not using the CLI, I am using the API directly. Also, I am =
transforming the data into an RDD of (BigDecimal, String), mapping that =
to (String,String) and creating an IndexedDatasetSpark which I feed into =
rowSimilarityIDS(). This same process works flawlessly when calling =
cooccurrencesIDSs(Array(IDS)) on an IDS that was generated from an RDD =
of (<tag>, <doc_id>).

My string tags do have some special characters, so I have been simply =
hashing them into an md5 string as a precaution since it shouldn't =
change the final result. I will try and scan the data for any nulls or =
other oddities. If I can't find anything obvious, then I'll try to pair =
it down to a small enough sample that is still affected in order to =
share.

Are there any normalizing rules that I should be aware of? For example, =
all the doc_id's must be the same length string?

Thanks,

Travis

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Friday, July 10, 2015 1:34 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from =
org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Ok. Don=E2=80=99t suppose you could share your data or at least a =
snippet? Some odd errors can creep in if there is invalid data, like a =
null doc id or tag. Very little data validation is done, which is =
something I need to address. I=E2=80=99ll it try on some sample data I =
have.

BTW you understand that rowSimilarity input is a doc-id, list-of-tags =
where by default tab separates doc-id from the list and a space =
separates items in the list. Separators can be changed in the code but =
not the CLI.


On Jul 10, 2015, at 9:54 AM, Hegner, Travis <THegner@trilliumit.com> =
wrote:

Thanks Pat,

With a clean version of your spark-1.3 branch I continue to get the =
error. You can find the stack trace at the end of the message. As I =
mentioned in my original message, I've narrowed it down to (k21 < 0), =
however, I'm not entirely certain it's based on the data condition I =
described, as I set up a test case with a small amount of data =
exhibiting the same condition described, and it works OK.

How is it possible that "numInteractionsWithB=3D0" while =
"numInteractionsWithAandB=3D1"? I would think that the latter would =
always have to be less than or equal the former.

Thanks!

Travis

java.lang.IllegalArgumentException
at =
com.google.common.base.Preconditions.checkArgument(Preconditions.java:72)
at =
org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(LogLikelihoo=
d.java:101)
at =
org.apache.mahout.math.cf.SimilarityAnalysis$.logLikelihoodRatio(Similarit=
yAnalysis.scala:201)
at =
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$=
anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:229)
at =
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$=
anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:222)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at =
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1.a=
pply$mcVI$sp(SimilarityAnalysis.scala:222)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at =
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAn=
alysis.scala:215)
at =
org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAn=
alysis.scala:208)
at =
org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.sc=
ala:33)
at =
org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.sc=
ala:32)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at =
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.s=
cala:1071)
at =
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.s=
cala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at =
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:=
1145)
at =
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java=
:615)
at java.lang.Thread.run(Thread.java:745)

-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com]
Sent: Thursday, July 09, 2015 10:09 PM
To: user@mahout.apache.org
Subject: Re: RowSimilarity API -- illegal argument exception from =
org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

I am using Mahout every day on Spark 1.3.1.

Try https://github.com/pferrel/mahout/tree/spark-1.3, which is the one =
I=E2=80=99m using. Let me know if you still have the problem and include =
the stack trace. I=E2=80=99ve been using cooccurrence, which is closely =
related to rowSimilarity.

> Third, what would be the mathematical implications if I run =
SimilarityAnalysis.cooccurrencesIDSs() with a list of =
(<tag>,<document_id>) pairs. Would the results be sound, or does that =
make absolutely no sense? Would it be beneficial even as only a =
troubleshooting step?

cooccurrence calculates llr(A=E2=80=99A), and rowSimilarity is doing =
llr(AA=E2=80=99). The input you are talking about is A=E2=80=99 so you =
would be doing llr((A=E2=80=99)=E2=80=99(A=E2=80=99)) and so should =
produce the same results but let=E2=80=99s get it working. I=E2=80=99ll =
look at it either tomorrow or this weekend. If you have any stack trace =
using the above branch, let me know.

BTW what Dmitriy said is correct, IntelliJ is often not able to =
determine every decoration function available.


On Jul 9, 2015, at 12:02 PM, Hegner, Travis <THegner@trilliumit.com> =
wrote:

FYI, I just tested against the latest spark-1.3 version I found at: =
https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1653-shell-master

I am getting the exact results described below.

Thanks again!

Travis

-----Original Message-----
From: Hegner, Travis [mailto:THegner@trilliumit.com]
Sent: Thursday, July 09, 2015 10:25 AM
To: 'user@mahout.apache.org'
Subject: RowSimilarity API -- illegal argument exception from =
org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio()

Hello list,

I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() =
job to run. First some info on my environment:

I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn =
setup it's pretty much an OOTB setup, but it has been upgraded many =
times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps =
some 1.3.1 commits merged in from what I've read about cloudera's =
versioning). I have my own fork of mahout which is currently just a =
mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making =
changes, compiling, and using my version of the library should your =
suggestions lead me in that direction. I am still pretty new to scala, =
so I have a hard time wrapping my head around what some of the syntactic =
sugars actually do, but I'm getting there.

I'm successfully getting my data transformed to an RDD that essentially =
looks like (<document_id>, <tag>), creating an IndexedDataSet with that, =
and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been =
able to narrow the issue down to a specific case:

Let's say I have the following records (among others) in my RDD:

...
(doc1, tag1)
(doc2, tag1)
...

doc1, and doc2 have no other tags, but tag1 may exist on many other =
documents. The rest of my dataset has many other doc/tag combinations, =
but I've narrowed down the issue to seemingly only occur in this case. =
I've been able to trace down that the java.lang.IllegalArgumentException =
is occuring because k21 is < 0 (i.e. "numInteractionsWithB =3D 0" and =
"numInteractionsWithAandB =3D 1") when calling =
LogLikelihood.logLikelihoodRatio() from =
SimilarityAnalysis.logLikelihoodRatio().

Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on =
the line (163 in my branch):

val bcastInteractionsPerItemA =3D =
drmBroadcast(drmA.numNonZeroElementsPerRow)

...my IDE (intellij) complains that it cannot resolve =
"drmA.numNonZeroElementsPerRow", however the library compiles =
successfully. Tracing the codepath shows that if that value is not being =
correctly populated, it would have a direct impact on the values used in =
logLikelihoodRatio(). That said, it seems to only fail in this very =
particular case.

I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() =
successfully with a single list of (<user_id>, <item_id>) pairs of my =
own data.

I have 3 questions given this scenario:

First, am I using the proper branch of code for attempting to run on a =
spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and =
this was the only branch I could find for it.

Second, Is anyone able to shed some light on the above error? Is drmA =
not a correct type, or does that method no longer apply to that type?

Third, what would be the mathematical implications if I run =
SimilarityAnalysis.cooccurrencesIDSs() with a list of =
(<tag>,<document_id>) pairs. Would the results be sound, or does that =
make absolutely no sense? Would it be beneficial even as only a =
troubleshooting step?

Thanks in advance for any help you may be able to provide!

Travis Hegner

________________________________

The information contained in this communication is confidential and is =
intended only for the use of the named recipient. Unauthorized use, =
disclosure, or copying is strictly prohibited and may be unlawful. If =
you have received this communication in error, you should know that you =
are bound to confidentiality, and should please immediately notify the =
sender.

________________________________

The information contained in this communication is confidential and is =
intended only for the use of the named recipient. Unauthorized use, =
disclosure, or copying is strictly prohibited and may be unlawful. If =
you have received this communication in error, you should know that you =
are bound to confidentiality, and should please immediately notify the =
sender.


________________________________

The information contained in this communication is confidential and is =
intended only for the use of the named recipient. Unauthorized use, =
disclosure, or copying is strictly prohibited and may be unlawful. If =
you have received this communication in error, you should know that you =
are bound to confidentiality, and should please immediately notify the =
sender.


________________________________

The information contained in this communication is confidential and is =
intended only for the use of the named recipient. Unauthorized use, =
disclosure, or copying is strictly prohibited and may be unlawful. If =
you have received this communication in error, you should know that you =
are bound to confidentiality, and should please immediately notify the =
sender.


________________________________

The information contained in this communication is confidential and is =
intended only for the use of the named recipient. Unauthorized use, =
disclosure, or copying is strictly prohibited and may be unlawful. If =
you have received this communication in error, you should know that you =
are bound to confidentiality, and should please immediately notify the =
sender.


________________________________

The information contained in this communication is confidential and is =
intended only for the use of the named recipient. Unauthorized use, =
disclosure, or copying is strictly prohibited and may be unlawful. If =
you have received this communication in error, you should know that you =
are bound to confidentiality, and should please immediately notify the =
sender.