mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Upgrade to Spark 1.1.0?
Date Tue, 21 Oct 2014 22:28:55 GMT
hm no they don't push different binary releases to maven. I assume they
only push the default one.

On Tue, Oct 21, 2014 at 3:26 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> ps i remember discussion for packaging binary spark distributions. So
> there's in fact a number of different spark artifact releases. However, i
> am not sure if they are pushing them to mvn repositories. (if they did,
> they might use different maven classifiers for those). If that's the case,
> then one plausible strategy here is to recommend rebuilding mahout with
> dependency to a classifier corresponding to the actual spark binary release
> used.
>
> On Tue, Oct 21, 2014 at 2:21 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
>
>> if you are using mahout shell or command line drivers (which i dont) it
>> would seem the correct thing to do is for mahout script simply to take
>> spark dependencies from installed $SPARK_HOME rather than from Mahout's
>> assembly. In fact that would be consistent with what other projects are
>> doing in similar situation. it should also probably make things compatible
>> between minor releases of spark.
>>
>> But i think you are right in a sense that the problem is that spark jars
>> are not uniquely encompassed by maven artifact id and version, unlike with
>> most other products. (e.g. if we see mahout-math-0.9.jar we expect there to
>> be one and only one released artifact in existence -- but one's local build
>> may create incompatible variations).
>>
>> On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel <pat@occamsmachete.com>
>> wrote:
>>
>>> The problem is not in building Spark it is in building Mahout using the
>>> correct Spark jars. If you are using CDH and hadoop 2 the correct jars are
>>> in the repos.
>>>
>>> For the rest of us, though the process below seems like an error prone
>>> hack to me it does work on Linux and BSD/mac. It should really be addressed
>>> by Spark imo.
>>>
>>> BTW The cache is laid out differently on linux but I don’t think you
>>> need to delete is anyway.
>>>
>>> On Oct 21, 2014, at 12:27 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>>> wrote:
>>>
>>> fwiw i never built spark using maven. Always use sbt assembly.
>>>
>>> On Tue, Oct 21, 2014 at 11:55 AM, Pat Ferrel <pat@occamsmachete.com>
>>> wrote:
>>>
>>> > Ok, the mystery is solved.
>>> >
>>> > The safe sequence from my limited testing is:
>>> > 1) delete ~/.m2/repository/org/spark and mahout
>>> > 2) build Spark for your version of Hadoop *but do not use "mvn package
>>> > ...”* use “mvn install …” This will put a copy of the exact bits
you
>>> need
>>> > into the maven cache for building mahout against. In my case using
>>> hadoop
>>> > 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If
>>> you
>>> > run tests on Spark some failures can safely be ignored according to the
>>> > Spark guys so check before giving up.
>>> > 3) build mahout with “mvn clean install"
>>> >
>>> > This will create mahout from exactly the same bits you will run on your
>>> > cluster. It got rid of a missing anon function for me. The problem
>>> occurs
>>> > when you use a different version of Spark on your cluster than you
>>> used to
>>> > build Mahout and this is rather hidden by Maven. Maven downloads from
>>> repos
>>> > any dependency that is not in the local .m2 cache and so you have to
>>> make
>>> > sure your version of Spark is there so Maven wont download one that is
>>> > incompatible. Unless you really know what you are doing I’d build both
>>> > Spark and Mahout for now
>>> >
>>> > BTW I will check in the Spark 1.1.0 version of Mahout once I do some
>>> more
>>> > testing.
>>> >
>>> > On Oct 21, 2014, at 10:26 AM, Pat Ferrel <pat@occamsmachete.com>
>>> wrote:
>>> >
>>> > Sorry to hear. I bet you’ll find a way.
>>> >
>>> > The Spark Jira trail leads to two suggestions:
>>> > 1) use spark-submit to execute code with your own entry point (other
>>> than
>>> > spark-shell) One theory points to not loading all needed Spark classes
>>> from
>>> > calling code (Mahout in our case). I can hand check the jars for the
>>> anon
>>> > function I am missing.
>>> > 2) there may be different class names in the running code (created by
>>> > building Spark locally) and the  version referenced in the Mahout POM.
>>> If
>>> > this turns out to be true it means we can’t rely on building Spark
>>> locally.
>>> > Is there a maven target that puts the artifacts of the Spark build in
>>> the
>>> > .m2/repository local cache? That would be an easy way to test this
>>> theory.
>>> >
>>> > either of these could cause missing classes.
>>> >
>>> >
>>> > On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>>> wrote:
>>> >
>>> > no i havent used it with anything but 1.0.1 and 0.9.x .
>>> >
>>> > on a side note, I just have changed my employer. It is one of these big
>>> > guys that make it very difficult to do any contributions. So I am not
>>> sure
>>> > how much of anything i will be able to share/contribute.
>>> >
>>> > On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <pat@occamsmachete.com>
>>> wrote:
>>> >
>>> >> But unless you have the time to devote to errors avoid it. I’ve built
>>> >> everything from scratch using 1.0.2 and 1.1.0 and am getting these and
>>> >> missing class errors. The 1.x branch seems to have some kind of
>>> peculiar
>>> >> build order dependencies. The errors sometimes don’t show up until
>>> > runtime,
>>> >> passing all build tests.
>>> >>
>>> >> Dmitriy, have you successfully used any Spark version other than
>>> 1.0.1 on
>>> >> a cluster? If so do you recall the exact order and from what sources
>>> you
>>> >> built?
>>> >>
>>> >>
>>> >> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>>> wrote:
>>> >>
>>> >> You can't use spark client of one version and have the backend of
>>> > another.
>>> >> You can try to change spark dependency in mahout poms to match your
>>> > backend
>>> >> (or vice versa, you can change your backend to match what's on the
>>> > client).
>>> >>
>>> >> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija <
>>> > balijamahesh.mca@gmail.com
>>> >>>
>>> >> wrote:
>>> >>
>>> >>> Hi All,
>>> >>>
>>> >>> Here are the errors I get which I run in a pseudo distributed mode,
>>> >>>
>>> >>> Spark 1.0.2 and Mahout latest code (Clone)
>>> >>>
>>> >>> When I run the command in page,
>>> >>> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
>>> >>>
>>> >>> val drmX = drmData(::, 0 until 4)
>>> >>>
>>> >>> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class
>>> >>> incompatible: stream classdesc serialVersionUID = 385418487991259089,
>>> >>> local class serialVersionUID = -6766554341038829528
>>> >>>     at
>>> >>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>> >>>     at
>>> >>>
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>> >>>     at
>>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>> >>>     at
>>> >>>
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>> >>>     at
>>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>> >>>     at
>>> >>>
>>> >
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>> >>>     at
>>> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>> >>>     at
>>> >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>> >>>     at
>>> >>>
>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>> >>>     at
>>> >>>
>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>> >>>     at
>>> >>>
>>> >
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>> >>>     at
>>> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>> >>>     at
>>> >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>> >>>     at
>>> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> >>>     at java.lang.Thread.run(Thread.java:701)
>>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
>>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
>>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
>>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
>>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
>>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
>>> >>> org.apache.spark.SparkException: Job aborted due to stage failure:
>>> >>> Task 0.0:0 failed 4 times, most recent failure: Exception failure
in
>>> >>> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException:
>>> >>> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc
>>> >>> serialVersionUID = 385418487991259089, local class serialVersionUID
=
>>> >>> -6766554341038829528
>>> >>>
>>>  java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>>> >>>
>>> >>>
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>> >>>
>>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>> >>>
>>> >>>
>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>>> >>>
>>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>>> >>>
>>> >>>
>>> >
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>>> >>>
>>>  java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>> >>>     java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>> >>>
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>> >>>
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>>> >>>
>>> >>>
>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>>> >>>
>>> >>>
>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>>> >>>
>>> >>>
>>> >
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>>> >>>
>>>  java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>>> >>>     java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>>> >>>
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>>> >>>
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>>> >>>
>>> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>>> >>>
>>> >>>
>>> >>
>>> >
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>>> >>>
>>> >>>
>>> >>
>>> >
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> >>>     java.lang.Thread.run(Thread.java:701)
>>> >>> Driver stacktrace:
>>> >>>     at org.apache.spark.scheduler.DAGScheduler.org
>>> >>>
>>> >>
>>> >
>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> >>>     at
>>> >>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>> >>>     at scala.Option.foreach(Option.scala:236)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>> >>>     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>> >>>     at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>> >>>     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>> >>>     at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>> >>>     at
>>> >>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>> >>>     at
>>> >>>
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>> >>>     at
>>> >>>
>>> >>
>>> >
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> >>>
>>> >>> Best,
>>> >>> Mahesh Balija.
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>> >
>>> >>> wrote:
>>> >>>
>>> >>>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <pat@occamsmachete.com>
>>> >>> wrote:
>>> >>>>
>>> >>>>> Is anyone else nervous about ignoring this issue or relying
on
>>> >>> non-build
>>> >>>>> (hand run) test driven transitive dependency checking. I
hope
>>> someone
>>> >>>> else
>>> >>>>> will chime in.
>>> >>>>>
>>> >>>>> As to running unit tests on a TEST_MASTER I’ll look into
it. Can we
>>> > set
>>> >>>> up
>>> >>>>> the build machine to do this? I’d feel better about eyeballing
>>> deps if
>>> >>> we
>>> >>>>> could have a TEST_MASTER automatically run during builds
at Apache.
>>> >>> Maybe
>>> >>>>> the regular unit tests are OK for building locally ourselves.
>>> >>>>>
>>> >>>>>>
>>> >>>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>> >
>>> >>>>> wrote:
>>> >>>>>>
>>> >>>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <
>>> pat@occamsmachete.com>
>>> >>>>> wrote:
>>> >>>>>>
>>> >>>>>>> Maybe a more fundamental issue is that we don’t
know for sure
>>> >>> whether
>>> >>>> we
>>> >>>>>>> have missing classes or not. The job.jar at least
used the pom
>>> >>>>> dependencies
>>> >>>>>>> to guarantee every needed class was present. So
the job.jar
>>> seems to
>>> >>>>> solve
>>> >>>>>>> the problem but may ship some unnecessary duplicate
code, right?
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>> No, as i wrote spark doesn't  work with job jar format.
Neither
>>> as it
>>> >>>>> turns
>>> >>>>>> out more recent hadoop MR btw.
>>> >>>>>
>>> >>>>> Not speaking literally of the format. Spark understands
jars and
>>> maven
>>> >>>> can
>>> >>>>> build one from transitive dependencies.
>>> >>>>>
>>> >>>>>>
>>> >>>>>> Yes, this is A LOT of duplicate code (will take normally
MINUTES
>>> to
>>> >>>>> startup
>>> >>>>>> tasks with all of it just on copy time). This is absolutely
not
>>> the
>>> >>> way
>>> >>>>> to
>>> >>>>>> go with this.
>>> >>>>>>
>>> >>>>>
>>> >>>>> Lack of guarantee to load seems like a bigger problem than
startup
>>> >>> time.
>>> >>>>> Clearly we can’t just ignore this.
>>> >>>>>
>>> >>>>
>>> >>>> Nope. given highly iterative nature and dynamic task allocation
in
>>> this
>>> >>>> environment, one is looking to effects similar to Map Reduce.
This
>>> is
>>> >> not
>>> >>>> the only reason why I never go to MR anymore, but that's one
of main
>>> >>> ones.
>>> >>>>
>>> >>>> How about experiment: why don't you create assembly that copies
ALL
>>> >>>> transitive dependencies in one folder, and then try to broadcast
it
>>> > from
>>> >>>> single point (front end) to well... let's start with 20 machines.
>>> (of
>>> >>>> course we ideally want to into 10^3 ..10^4 range -- but why
bother
>>> if
>>> > we
>>> >>>> can't do it for 20).
>>> >>>>
>>> >>>> Or, heck, let's try to simply parallel-copy it between too machines
>>> 20
>>> >>>> times that are not collocated on the same subnet.
>>> >>>>
>>> >>>>
>>> >>>>>>
>>> >>>>>>> There may be any number of bugs waiting for the
time we try
>>> running
>>> >>>> on a
>>> >>>>>>> node machine that doesn’t have some class in it’s
classpath.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> No. Assuming any given method is tested on all its execution
>>> paths,
>>> >>>> there
>>> >>>>>> will be no bugs. The bugs of that sort will only appear
if the
>>> user
>>> >>> is
>>> >>>>>> using algebra directly and calls something that is not
on the
>>> path,
>>> >>>> from
>>> >>>>>> the closure. In which case our answer to this is the
same as for
>>> the
>>> >>>>> solver
>>> >>>>>> methodology developers -- use customized SparkConf while
creating
>>> >>>> context
>>> >>>>>> to include stuff you really want.
>>> >>>>>>
>>> >>>>>> Also another right answer to this is that we probably
should
>>> >>> reasonably
>>> >>>>>> provide the toolset here. For example, all the stats
stuff found
>>> in R
>>> >>>>> base
>>> >>>>>> and R stat packages so the user is not compelled to
go non-native.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>> Huh? this is not true. The one I ran into was found by calling
>>> >>> something
>>> >>>>> in math from something in math-scala. It led outside and
you can
>>> >>>> encounter
>>> >>>>> such things even in algebra.  In fact you have no idea if
these
>>> >>> problems
>>> >>>>> exists except for the fact you have used it a lot personally.
>>> >>>>>
>>> >>>>
>>> >>>>
>>> >>>> You ran it with your own code that never existed before.
>>> >>>>
>>> >>>> But there's difference between released Mahout code (which is
what
>>> you
>>> >>> are
>>> >>>> working on) and the user code. Released code must run thru remote
>>> tests
>>> >>> as
>>> >>>> you suggested and thus guarantee there are no such problems
with
>>> post
>>> >>>> release code.
>>> >>>>
>>> >>>> For users, we only can provide a way for them to load stuff
that
>>> they
>>> >>>> decide to use. We don't have apriori knowledge what they will
use.
>>> It
>>> > is
>>> >>>> the same thing that spark does, and the same thing that MR does,
>>> > doesn't
>>> >>>> it?
>>> >>>>
>>> >>>> Of course mahout should drop rigorously the stuff it doesn't
load,
>>> from
>>> >>> the
>>> >>>> scala scope. No argue about that. In fact that's what i suggested
>>> as #1
>>> >>>> solution. But there's nothing much to do here but to go dependency
>>> >>>> cleansing for math and spark code. Part of the reason there's
so
>>> much
>>> > is
>>> >>>> because newer modules still bring in everything from mrLegacy.
>>> >>>>
>>> >>>> You are right in saying it is hard to guess what else dependencies
>>> are
>>> >> in
>>> >>>> the util/legacy code that are actually used. but that's not
a
>>> >>> justification
>>> >>>> for brute force "copy them all" approach that virtually guarantees
>>> >>> ruining
>>> >>>> one of the foremost legacy issues this work intended to address.
>>> >>>>
>>> >>>
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message