Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4E3C1181CC for ; Thu, 16 Jul 2015 20:35:30 +0000 (UTC) Received: (qmail 27404 invoked by uid 500); 16 Jul 2015 20:35:28 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 27350 invoked by uid 500); 16 Jul 2015 20:35:28 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 27339 invoked by uid 99); 16 Jul 2015 20:35:28 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Jul 2015 20:35:28 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 9D874C0098 for ; Thu, 16 Jul 2015 20:35:27 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.801 X-Spam-Level: X-Spam-Status: No, score=0.801 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id QvQLDhoyMCES for ; Thu, 16 Jul 2015 20:35:13 +0000 (UTC) Received: from mail-pa0-f43.google.com (mail-pa0-f43.google.com [209.85.220.43]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id E259424A29 for ; Thu, 16 Jul 2015 20:35:12 +0000 (UTC) Received: by padck2 with SMTP id ck2so48095435pad.0 for ; Thu, 16 Jul 2015 13:35:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:content-type:mime-version:subject:from :in-reply-to:date:content-transfer-encoding:message-id:references:to; bh=ZDP2VKdUGiNwrTYz4hRG21Orv/k8y+Is3Z6kI+I5lCU=; b=LjzfhXXxaGp/0zWrt1yceRuG6P9KgJIwKivK3SiJsAE2ORdhKuxEGMW7c6p82N4W6s Ytp7JJXn390b1uR25Fpad1bOv0y1pVtm8GX5fcgFyHnZWHEAFGq5i5SOzrSpRQ6+rG2d SPmwERu+5ETDV3DQZqPD5n7Wkn1T8lt92/xCSjNJ+M6WeBBwVwILWXU9pe6YyaBIa0ur 6+c2cavVr4EO7ckdAZAPOTRnA2R8mvWYh4oZkR4VfZuuIkTrEXQ7rPbtlmL9y/M410L5 5cWa+j7Q9ZV0xuVch77HLgYRn0yzJOO9/UtrPgFA/DDg70FMTBDnyCyaOiKgGBwHFi34 1ayg== X-Gm-Message-State: ALoCoQmDhzKPyJKlhJMkidasfsHe9mV03xbWDprUJ2nFC+7hP2RZtpR1/SXW/kgpVHT02sSenVAl X-Received: by 10.66.55.66 with SMTP id q2mr22247835pap.94.1437078912429; Thu, 16 Jul 2015 13:35:12 -0700 (PDT) Received: from [192.168.0.2] (c-24-22-234-117.hsd1.wa.comcast.net. [24.22.234.117]) by smtp.gmail.com with ESMTPSA id a10sm8861099pdn.57.2015.07.16.13.35.10 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 16 Jul 2015 13:35:10 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2102\)) Subject: Re: RowSimilarity API -- illegal argument exception from org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio() From: Pat Ferrel In-Reply-To: Date: Thu, 16 Jul 2015 13:35:09 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <50f7ded4ff6146dcb22524c136aca23d@TSS-EX2013-1.ad.trilliumstaffing.com> <08953C8B-89BE-4141-828B-6DF883F89601@occamsmachete.com> <62b5885c26a04b6591f613f44e347d93@TSS-EX2013-1.ad.trilliumstaffing.com> <1158DF55-0E78-4B60-9C9E-B54B70256DCD@occamsmachete.com> <157fd19a32534f8eb11a46174eb8c167@TSS-EX2013-1.ad.trilliumstaffing.com> <0CCAAF7C-E3E6-446D-9EA1-7FDA32482E64@occamsmachete.com> <38bdc430923f448a9e80f53bbac8d206@TSS-EX2013-1.ad.trilliumstaffing.com> <770200EC-4A8F-4378-AEC0-FFA6D5A85F56@occamsmachete.com> To: "user@mahout.apache.org" X-Mailer: Apple Mail (2.2102) I can=E2=80=99t exactly reproduce your environment and I=E2=80=99m also = unable to reproduce your error using the CLI. if you take that snippet = of data from your program and use the CLI to read it, does the error = still occur because when I try everything is fine. But like I said, I can=E2=80=99t use a clustered yarn. Here=E2=80=99s a snippet you can try, a trait that I attach to my Scala = App to get some Mahout/Spark setup. It will create a SparkContext inside = of mahoutSparkContext so do all your job config first. /** Put things here that setup the context for typical execution with = Spark. This * should be mixed in to the object executing rdd operations to provide = implicit config and context * values. */ abstract trait SparkJobContext { implicit protected var sparkConf =3D new SparkConf() implicit protected var sc: SparkContext =3D _ implicit var mc: SparkDistributedContext =3D _ def setupSparkContext(master: String =3D "local", sem: String =3D = "3G", appName: String =3D this.getClass.getSimpleName, customJars: String = =3D ""): Unit =3D { val closeables =3D new java.util.ArrayDeque[Closeable]() sparkConf.set("spark.kryo.referenceTracking", "false") .set("spark.kryoserializer.buffer.mb", "200") val jars =3D customJars.split(",").toTraversable mc =3D mahoutSparkContext(masterUrl =3D master, appName =3D appName, = customJars =3D jars, sparkConf =3D sparkConf) sc =3D mc.sc } } On Jul 13, 2015, at 10:37 AM, Hegner, Travis = wrote: I've added the Kryo registrator configs to my SparkConf and get the same = results. I am very new to Mahout so I was not aware of the requirement = for those. I did have some unused imports in there if that is what you meant by the = "distributed context" comment. I've pushed a couple of updates to add = the configs you mentioned, and remove the unused imports. Keep in mind that the program I linked to has the sole purpose of = reproducing the exception I'm experiencing. I am not using int.max in my = actual driver program, I am using just the defaults of 50 and 500. I = only put those into this program to make it easier to reproduce the = exception. The reproduction program is as stripped down and simple as = possible while still producting the exception to attempt to aid in = troubleshooting this thing. Is there some documentation somewhere on the = recommended minimum sizes for those parameters given the size of a = dataset? I'm sure that question is dataset specific, but some general = guidelines could be helpful if they exist somewhere so that I'm not = burning CPU for no reasonable difference in accuracy. Given your suggestions, I am still getting the same exception. = Everything for the spark instance on my cloudera cluster is the default. = Would it still be helpful to see a dump of information from somewhere? = The 'Environment' tab from the job's web interface? I typically try to = let everything run with defaults, until I need to make/test something = more specific. I guess it's how I learn to use the software. I am = running this command to submit the job: spark-submit --class com.travishegner.RowSimTest.RowSimTest = RowSimTest-0.0.1-SNAPSHOT.jar The only difference in the calling command for my real driver program is = a "--jars" option to distribute a dependency. Thanks again for the help! Travis -----Original Message----- From: Pat Ferrel [mailto:pat@occamsmachete.com] Sent: Monday, July 13, 2015 12:36 PM To: user@mahout.apache.org Cc: Dmitriy Lyubimov Subject: Re: RowSimilarity API -- illegal argument exception from = org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio() You aren=E2=80=99t setting up the Mahout Kryo registrator. Without this = I=E2=80=99m surprised it runs at all. Make sure the Spark settings use = these values: "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.kryo.registrator": = "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator", "spark.kryo.referenceTracking": "false", "spark.kryoserializer.buffer.mb": "300", "spark.executor.memory": =E2=80=9C4g=E2=80=9D // or something larger = than default Not sure if the distributed context is needed too, maybe Dmitriy knows = more. BTW I wouldn=E2=80=99t use Int.max. The calculation will approach O(n^2) = with virtually no effective gain in accuracy and my even cause problems. If none of this helps I can set up yarn on my dev machine, can you give = me the spark-submit CLI and all Spark settings? On Jul 13, 2015, at 7:51 AM, Hegner, Travis = wrote: So, I've yet to be able to reproduce this with "--master local" or = "--master local[8]", it has only occurred on my cloudera/spark/yarn = cluster with both "--master yarn-client" and "--master yarn-cluster". I = don't have a spark standalone cluster to test against. I have put up a test program on my github account which contains hard = coded test data: https://github.com/travishegner/RowSimTest. My pom.xml = is including the mahout libraries into my final jar via shade in order = to test against my own version of mahout (actually your's right now = Pat!), rather than the one built into the cluster. With this dataset the exception is sporadic (50% maybe) with the default = params for "maxInterestingSimilaritiesPerRow" and = "maxObservationsPerRow", but when I pass Int.MaxValue for each of those = it seems to occur more regularly, but still succeeds at times. = Sometimes, my driver program will throw the exception, but retry the = failed task and continue on to complete the program successfully, other = times it will completely fail after too many retries. I can literally = run the same jar back-to-back without recompiling and get different = results. I also ruled out a hardware issue by decommisioning the Yarn = NodeManager service on all but one of my nodes to isolate it to a single = node. I did that again on a separate node with similar results. The = frequency of the exception is directly related to the size of the = dataset. The smaller I make the dataset, the more often it succeeds, and = I have yet to get a successful execution with a large enough subset of = my full dataset. Interestingly enough, if I map the IDS into flipped values (, = ) and run it through the cooccurrencesIDSs() method, it never = fails (see the commented code block). If I run the reverse mapping = through rowSimilarityIDS(), it still fails in the same way. Can you recommend any other troubleshooting steps to try? Is there any = more information that I can provide? Thanks, Travis -----Original Message----- From: Pat Ferrel [mailto:pat@occamsmachete.com] Sent: Sunday, July 12, 2015 8:18 PM To: user@mahout.apache.org Subject: Re: RowSimilarity API -- illegal argument exception from = org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio() I tried a couple datasets this weekend and could not get the error to = reproduce. Could you share some data or the code that creates the = IndexedDataset? I wonder whether the IndexedDataset is created correctly, it will = construct from rdd and two BiDictionaries but that doesn=E2=80=99t mean = they have correctly formatted values. It needs a Mahout DRM in the rdd, = which means int keys and vector values with two BiDictionaries for key = <-> string mappings for column and row. Also the int keys need to be = contiguous ints 0..n On Jul 10, 2015, at 11:40 AM, Pat Ferrel wrote: The IndexDataset creates two BiDictionaries (Bi-directional = dictionaries) of Int <-> String so if it can be a String the element ids = have no other restrictions. May indeed be a bug I=E2=80=99ll look at is asap, since it passes the = scala tests, any data you can spare might help but if you are doing a = lot of prep, maybe that=E2=80=99s not so easy? On Jul 10, 2015, at 11:16 AM, Hegner, Travis = wrote: I am actually not using the CLI, I am using the API directly. Also, I am = transforming the data into an RDD of (BigDecimal, String), mapping that = to (String,String) and creating an IndexedDatasetSpark which I feed into = rowSimilarityIDS(). This same process works flawlessly when calling = cooccurrencesIDSs(Array(IDS)) on an IDS that was generated from an RDD = of (, ). My string tags do have some special characters, so I have been simply = hashing them into an md5 string as a precaution since it shouldn't = change the final result. I will try and scan the data for any nulls or = other oddities. If I can't find anything obvious, then I'll try to pair = it down to a small enough sample that is still affected in order to = share. Are there any normalizing rules that I should be aware of? For example, = all the doc_id's must be the same length string? Thanks, Travis -----Original Message----- From: Pat Ferrel [mailto:pat@occamsmachete.com] Sent: Friday, July 10, 2015 1:34 PM To: user@mahout.apache.org Subject: Re: RowSimilarity API -- illegal argument exception from = org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio() Ok. Don=E2=80=99t suppose you could share your data or at least a = snippet? Some odd errors can creep in if there is invalid data, like a = null doc id or tag. Very little data validation is done, which is = something I need to address. I=E2=80=99ll it try on some sample data I = have. BTW you understand that rowSimilarity input is a doc-id, list-of-tags = where by default tab separates doc-id from the list and a space = separates items in the list. Separators can be changed in the code but = not the CLI. On Jul 10, 2015, at 9:54 AM, Hegner, Travis = wrote: Thanks Pat, With a clean version of your spark-1.3 branch I continue to get the = error. You can find the stack trace at the end of the message. As I = mentioned in my original message, I've narrowed it down to (k21 < 0), = however, I'm not entirely certain it's based on the data condition I = described, as I set up a test case with a small amount of data = exhibiting the same condition described, and it works OK. How is it possible that "numInteractionsWithB=3D0" while = "numInteractionsWithAandB=3D1"? I would think that the latter would = always have to be less than or equal the former. Thanks! Travis java.lang.IllegalArgumentException at = com.google.common.base.Preconditions.checkArgument(Preconditions.java:72) at = org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(LogLikelihoo= d.java:101) at = org.apache.mahout.math.cf.SimilarityAnalysis$.logLikelihoodRatio(Similarit= yAnalysis.scala:201) at = org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$= anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:229) at = org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1$$= anonfun$apply$mcVI$sp$1.apply(SimilarityAnalysis.scala:222) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at = org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4$$anonfun$apply$1.a= pply$mcVI$sp(SimilarityAnalysis.scala:222) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at = org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAn= alysis.scala:215) at = org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$4.apply(SimilarityAn= alysis.scala:208) at = org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.sc= ala:33) at = org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.sc= ala:32) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at = org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.s= cala:1071) at = org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.s= cala:1059) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at = java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:= 1145) at = java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java= :615) at java.lang.Thread.run(Thread.java:745) -----Original Message----- From: Pat Ferrel [mailto:pat@occamsmachete.com] Sent: Thursday, July 09, 2015 10:09 PM To: user@mahout.apache.org Subject: Re: RowSimilarity API -- illegal argument exception from = org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio() I am using Mahout every day on Spark 1.3.1. Try https://github.com/pferrel/mahout/tree/spark-1.3, which is the one = I=E2=80=99m using. Let me know if you still have the problem and include = the stack trace. I=E2=80=99ve been using cooccurrence, which is closely = related to rowSimilarity. > Third, what would be the mathematical implications if I run = SimilarityAnalysis.cooccurrencesIDSs() with a list of = (,) pairs. Would the results be sound, or does that = make absolutely no sense? Would it be beneficial even as only a = troubleshooting step? cooccurrence calculates llr(A=E2=80=99A), and rowSimilarity is doing = llr(AA=E2=80=99). The input you are talking about is A=E2=80=99 so you = would be doing llr((A=E2=80=99)=E2=80=99(A=E2=80=99)) and so should = produce the same results but let=E2=80=99s get it working. I=E2=80=99ll = look at it either tomorrow or this weekend. If you have any stack trace = using the above branch, let me know. BTW what Dmitriy said is correct, IntelliJ is often not able to = determine every decoration function available. On Jul 9, 2015, at 12:02 PM, Hegner, Travis = wrote: FYI, I just tested against the latest spark-1.3 version I found at: = https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1653-shell-master I am getting the exact results described below. Thanks again! Travis -----Original Message----- From: Hegner, Travis [mailto:THegner@trilliumit.com] Sent: Thursday, July 09, 2015 10:25 AM To: 'user@mahout.apache.org' Subject: RowSimilarity API -- illegal argument exception from = org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio() Hello list, I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() = job to run. First some info on my environment: I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn = setup it's pretty much an OOTB setup, but it has been upgraded many = times since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps = some 1.3.1 commits merged in from what I've read about cloudera's = versioning). I have my own fork of mahout which is currently just a = mirror of 'github.com:pferrel/spark-1.3'. I'm very comfortable making = changes, compiling, and using my version of the library should your = suggestions lead me in that direction. I am still pretty new to scala, = so I have a hard time wrapping my head around what some of the syntactic = sugars actually do, but I'm getting there. I'm successfully getting my data transformed to an RDD that essentially = looks like (, ), creating an IndexedDataSet with that, = and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been = able to narrow the issue down to a specific case: Let's say I have the following records (among others) in my RDD: ... (doc1, tag1) (doc2, tag1) ... doc1, and doc2 have no other tags, but tag1 may exist on many other = documents. The rest of my dataset has many other doc/tag combinations, = but I've narrowed down the issue to seemingly only occur in this case. = I've been able to trace down that the java.lang.IllegalArgumentException = is occuring because k21 is < 0 (i.e. "numInteractionsWithB =3D 0" and = "numInteractionsWithAandB =3D 1") when calling = LogLikelihood.logLikelihoodRatio() from = SimilarityAnalysis.logLikelihoodRatio(). Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on = the line (163 in my branch): val bcastInteractionsPerItemA =3D = drmBroadcast(drmA.numNonZeroElementsPerRow) ...my IDE (intellij) complains that it cannot resolve = "drmA.numNonZeroElementsPerRow", however the library compiles = successfully. Tracing the codepath shows that if that value is not being = correctly populated, it would have a direct impact on the values used in = logLikelihoodRatio(). That said, it seems to only fail in this very = particular case. I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() = successfully with a single list of (, ) pairs of my = own data. I have 3 questions given this scenario: First, am I using the proper branch of code for attempting to run on a = spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and = this was the only branch I could find for it. Second, Is anyone able to shed some light on the above error? Is drmA = not a correct type, or does that method no longer apply to that type? Third, what would be the mathematical implications if I run = SimilarityAnalysis.cooccurrencesIDSs() with a list of = (,) pairs. Would the results be sound, or does that = make absolutely no sense? Would it be beneficial even as only a = troubleshooting step? Thanks in advance for any help you may be able to provide! Travis Hegner ________________________________ The information contained in this communication is confidential and is = intended only for the use of the named recipient. Unauthorized use, = disclosure, or copying is strictly prohibited and may be unlawful. If = you have received this communication in error, you should know that you = are bound to confidentiality, and should please immediately notify the = sender. ________________________________ The information contained in this communication is confidential and is = intended only for the use of the named recipient. Unauthorized use, = disclosure, or copying is strictly prohibited and may be unlawful. If = you have received this communication in error, you should know that you = are bound to confidentiality, and should please immediately notify the = sender. ________________________________ The information contained in this communication is confidential and is = intended only for the use of the named recipient. Unauthorized use, = disclosure, or copying is strictly prohibited and may be unlawful. If = you have received this communication in error, you should know that you = are bound to confidentiality, and should please immediately notify the = sender. ________________________________ The information contained in this communication is confidential and is = intended only for the use of the named recipient. Unauthorized use, = disclosure, or copying is strictly prohibited and may be unlawful. If = you have received this communication in error, you should know that you = are bound to confidentiality, and should please immediately notify the = sender. ________________________________ The information contained in this communication is confidential and is = intended only for the use of the named recipient. Unauthorized use, = disclosure, or copying is strictly prohibited and may be unlawful. If = you have received this communication in error, you should know that you = are bound to confidentiality, and should please immediately notify the = sender. ________________________________ The information contained in this communication is confidential and is = intended only for the use of the named recipient. Unauthorized use, = disclosure, or copying is strictly prohibited and may be unlawful. If = you have received this communication in error, you should know that you = are bound to confidentiality, and should please immediately notify the = sender.