predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vaghawan Ojha <vaghawan...@gmail.com>
Subject Re: Not able to train data
Date Thu, 26 Oct 2017 09:10:27 GMT
Hi Abhimanyu,

I've never tried the classification template, So I'm not sure about how
much time would it exactly take. But as per your error, your model is not
going any far from stage 1. "Task 0 in stage 1.0 failed 1 times, " .

Probably something to do with the OOMs.
https://stackoverflow.com/questions/37260230/spark-cluster-full-of-heartbeat-timeouts-executors-exiting-on-their-own


did you see this?

On Thu, Oct 26, 2017 at 1:57 PM, Abhimanyu Nagrath <
abhimanyunagrath@gmail.com> wrote:

> Hi Vaghawan,
>
> For debugging I just made a change I just reduced the number if features
> to 1  record count being the same as 1 Million and hardware is (240 GB RAM
> , 32 cores and 100 GB SWAP) and training is still going on since 2 hrs.Is
> it an expected behavior. On which factors does the training time depend.
>
>
> Regards,
> Abhimanyu
>
>
> On Thu, Oct 26, 2017 at 12:41 PM, Abhimanyu Nagrath <
> abhimanyunagrath@gmail.com> wrote:
>
>> Hi Vaghawan,
>>
>> I have made that template compatible with the version mentioned
>> above. Changed versions of engine.json and changed packages name.
>>
>>
>> Regards,
>> Abhimanyu
>>
>> On Thu, Oct 26, 2017 at 12:39 PM, Vaghawan Ojha <vaghawan781@gmail.com>
>> wrote:
>>
>>> Hi Abhimanyu,
>>>
>>> I don't think this template works with version 0.11.0. As per the
>>> template :
>>>
>>> update for PredictionIO 0.9.2, including:
>>>
>>> I don't think it supports the latest pio. You rather switch it to 0.9.2
>>> if you want to experiment it.
>>>
>>> On Thu, Oct 26, 2017 at 12:52 PM, Abhimanyu Nagrath <
>>> abhimanyunagrath@gmail.com> wrote:
>>>
>>>> Hi Vaghawan ,
>>>>
>>>> I am using v0.11.0-incubating with (ES - v5.2.1 , Hbase - 1.2.6 , Spark
>>>> - 2.1.0).
>>>>
>>>> Regards,
>>>> Abhimanyu
>>>>
>>>> On Thu, Oct 26, 2017 at 12:31 PM, Vaghawan Ojha <vaghawan781@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Abhimanyu,
>>>>>
>>>>> Ok, which version of pio is this? Because the template looks old to
>>>>> me.
>>>>>
>>>>> On Thu, Oct 26, 2017 at 12:44 PM, Abhimanyu Nagrath <
>>>>> abhimanyunagrath@gmail.com> wrote:
>>>>>
>>>>>> Hi Vaghawan,
>>>>>>
>>>>>> yes, the spark master connection string is correct I am getting
>>>>>> executor fails to connect to spark master after 4-5 hrs.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Abhimanyu
>>>>>>
>>>>>> On Thu, Oct 26, 2017 at 12:17 PM, Sachin Kamkar <
>>>>>> sachinkamkar@gmail.com> wrote:
>>>>>>
>>>>>>> It should be correct, as the user got the exception after 3-4
hours
>>>>>>> of starting. So looks like something else broke. OOM?
>>>>>>>
>>>>>>> With Regards,
>>>>>>>
>>>>>>>      Sachin
>>>>>>> ⚜KTBFFH⚜
>>>>>>>
>>>>>>> On Thu, Oct 26, 2017 at 12:15 PM, Vaghawan Ojha <
>>>>>>> vaghawan781@gmail.com> wrote:
>>>>>>>
>>>>>>>> "Executor failed to connect with master ", are you sure the
--master
>>>>>>>> spark://*.*.*.*:7077 is correct?
>>>>>>>>
>>>>>>>> Like the one you copied from the spark master's web ui? sometimes
>>>>>>>> having that wrong fails to connect with the spark master.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> On Thu, Oct 26, 2017 at 12:02 PM, Abhimanyu Nagrath <
>>>>>>>> abhimanyunagrath@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I am new to predictionIO . I am using template
>>>>>>>>> https://github.com/EmergentOrder/template-scala-probabilisti
>>>>>>>>> c-classifier-batch-lbfgs.
>>>>>>>>>
>>>>>>>>> My training dataset count is 1184603 having approx 6500
features.
>>>>>>>>> I am using ec2 r4.8xlarge system (240 GB RAM, 32 Cores,
200 GB Swap).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I tried two ways for training
>>>>>>>>>
>>>>>>>>>  1. Command '
>>>>>>>>>
>>>>>>>>> > pio train -- --driver-memory 120G --executor-memory
100G -- conf
>>>>>>>>> > spark.network.timeout=10000000
>>>>>>>>>
>>>>>>>>> '
>>>>>>>>>   Its throwing exception after 3-4 hours.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>     Exception in thread "main" org.apache.spark.SparkException:
>>>>>>>>> Job aborted due to stage failure: Task 0 in stage 1.0
failed 1 times, most
>>>>>>>>> recent failure: Lost task 0.0 in stage 1.0 (TID 15, localhost,
executor
>>>>>>>>> driver): ExecutorLostFailure (executor driver exited
caused by one of the
>>>>>>>>> running tasks) Reason: Executor heartbeat timed out after
181529 ms
>>>>>>>>>     Driver stacktrace:
>>>>>>>>>             at org.apache.spark.scheduler.DAGScheduler.org
>>>>>>>>> $apache$spark$scheduler$DAGScheduler$$failJobAn
>>>>>>>>> dIndependentStages(DAGScheduler.scala:1435)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>>>>>>>>>             at scala.collection.mutable.Resiz
>>>>>>>>> ableArray$class.foreach(ResizableArray.scala:59)
>>>>>>>>>             at scala.collection.mutable.Array
>>>>>>>>> Buffer.foreach(ArrayBuffer.scala:48)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> Scheduler.abortStage(DAGScheduler.scala:1422)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.
>>>>>>>>> scala:802)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.
>>>>>>>>> scala:802)
>>>>>>>>>             at scala.Option.foreach(Option.scala:257)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> Scheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> SchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>>>>>>>>>             at org.apache.spark.util.EventLoo
>>>>>>>>> p$$anon$1.run(EventLoop.scala:48)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> Scheduler.runJob(DAGScheduler.scala:628)
>>>>>>>>>             at org.apache.spark.SparkContext.
>>>>>>>>> runJob(SparkContext.scala:1918)
>>>>>>>>>             at org.apache.spark.SparkContext.
>>>>>>>>> runJob(SparkContext.scala:1931)
>>>>>>>>>             at org.apache.spark.SparkContext.
>>>>>>>>> runJob(SparkContext.scala:1944)
>>>>>>>>>             at org.apache.spark.rdd.RDD$$anon
>>>>>>>>> fun$take$1.apply(RDD.scala:1353)
>>>>>>>>>             at org.apache.spark.rdd.RDDOperat
>>>>>>>>> ionScope$.withScope(RDDOperationScope.scala:151)
>>>>>>>>>             at org.apache.spark.rdd.RDDOperat
>>>>>>>>> ionScope$.withScope(RDDOperationScope.scala:112)
>>>>>>>>>             at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>>>>>>>>>             at org.apache.spark.rdd.RDD.take(RDD.scala:1326)
>>>>>>>>>             at org.example.classification.Log
>>>>>>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi
>>>>>>>>> thLBFGSAlgorithm.scala:28)
>>>>>>>>>             at org.example.classification.Log
>>>>>>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi
>>>>>>>>> thLBFGSAlgorithm.scala:21)
>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>> ller.P2LAlgorithm.trainBase(P2LAlgorithm.scala:49)
>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692)
>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692)
>>>>>>>>>             at scala.collection.TraversableLi
>>>>>>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234)
>>>>>>>>>             at scala.collection.TraversableLi
>>>>>>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234)
>>>>>>>>>             at scala.collection.immutable.Lis
>>>>>>>>> t.foreach(List.scala:381)
>>>>>>>>>             at scala.collection.TraversableLi
>>>>>>>>> ke$class.map(TraversableLike.scala:234)
>>>>>>>>>             at scala.collection.immutable.List.map(List.scala:285)
>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>> ller.Engine$.train(Engine.scala:692)
>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>> ller.Engine.train(Engine.scala:177)
>>>>>>>>>             at org.apache.predictionio.workfl
>>>>>>>>> ow.CoreWorkflow$.runTrain(CoreWorkflow.scala:67)
>>>>>>>>>             at org.apache.predictionio.workfl
>>>>>>>>> ow.CreateWorkflow$.main(CreateWorkflow.scala:250)
>>>>>>>>>             at org.apache.predictionio.workfl
>>>>>>>>> ow.CreateWorkflow.main(CreateWorkflow.scala)
>>>>>>>>>             at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>>> Method)
>>>>>>>>>             at sun.reflect.NativeMethodAccess
>>>>>>>>> orImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>>>>>             at sun.reflect.DelegatingMethodAc
>>>>>>>>> cessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>>             at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>> ubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSub
>>>>>>>>> mit.scala:738)
>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>> ubmit$.doRunMain$1(SparkSubmit.scala:187)
>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>> ubmit$.submit(SparkSubmit.scala:212)
>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>> ubmit$.main(SparkSubmit.scala:126)
>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>> ubmit.main(SparkSubmit.scala)
>>>>>>>>>
>>>>>>>>> 2. I started spark standalone cluster with 1 master and
3 workers
>>>>>>>>> and executed the command
>>>>>>>>>
>>>>>>>>> > pio train -- --master spark://*.*.*.*:7077 --driver-memory
50G
>>>>>>>>> > --executor-memory 50G
>>>>>>>>>
>>>>>>>>> And after some times getting the error . Executor failed
to
>>>>>>>>> connect with master and training gets stopped.
>>>>>>>>>
>>>>>>>>> I have changed the feature count from 6500 - > 500
and still the
>>>>>>>>> condition is same. So can anyone suggest me am I missing
something
>>>>>>>>>
>>>>>>>>> and In between training getting continuous warnings like
:
>>>>>>>>> [
>>>>>>>>>
>>>>>>>>> > WARN] [ScannerCallable] Ignore, probably already
closed
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Abhimanyu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message