predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abhimanyu Nagrath <abhimanyunagr...@gmail.com>
Subject Re: Not able to train data
Date Thu, 26 Oct 2017 08:12:31 GMT
Hi Vaghawan,

For debugging I just made a change I just reduced the number if features to
1  record count being the same as 1 Million and hardware is (240 GB RAM
, 32 cores and 100 GB SWAP) and training is still going on since 2 hrs.Is
it an expected behavior. On which factors does the training time depend.


Regards,
Abhimanyu


On Thu, Oct 26, 2017 at 12:41 PM, Abhimanyu Nagrath <
abhimanyunagrath@gmail.com> wrote:

> Hi Vaghawan,
>
> I have made that template compatible with the version mentioned
> above. Changed versions of engine.json and changed packages name.
>
>
> Regards,
> Abhimanyu
>
> On Thu, Oct 26, 2017 at 12:39 PM, Vaghawan Ojha <vaghawan781@gmail.com>
> wrote:
>
>> Hi Abhimanyu,
>>
>> I don't think this template works with version 0.11.0. As per the
>> template :
>>
>> update for PredictionIO 0.9.2, including:
>>
>> I don't think it supports the latest pio. You rather switch it to 0.9.2
>> if you want to experiment it.
>>
>> On Thu, Oct 26, 2017 at 12:52 PM, Abhimanyu Nagrath <
>> abhimanyunagrath@gmail.com> wrote:
>>
>>> Hi Vaghawan ,
>>>
>>> I am using v0.11.0-incubating with (ES - v5.2.1 , Hbase - 1.2.6 , Spark
>>> - 2.1.0).
>>>
>>> Regards,
>>> Abhimanyu
>>>
>>> On Thu, Oct 26, 2017 at 12:31 PM, Vaghawan Ojha <vaghawan781@gmail.com>
>>> wrote:
>>>
>>>> Hi Abhimanyu,
>>>>
>>>> Ok, which version of pio is this? Because the template looks old to me.
>>>>
>>>> On Thu, Oct 26, 2017 at 12:44 PM, Abhimanyu Nagrath <
>>>> abhimanyunagrath@gmail.com> wrote:
>>>>
>>>>> Hi Vaghawan,
>>>>>
>>>>> yes, the spark master connection string is correct I am getting
>>>>> executor fails to connect to spark master after 4-5 hrs.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Abhimanyu
>>>>>
>>>>> On Thu, Oct 26, 2017 at 12:17 PM, Sachin Kamkar <
>>>>> sachinkamkar@gmail.com> wrote:
>>>>>
>>>>>> It should be correct, as the user got the exception after 3-4 hours
>>>>>> of starting. So looks like something else broke. OOM?
>>>>>>
>>>>>> With Regards,
>>>>>>
>>>>>>      Sachin
>>>>>> ⚜KTBFFH⚜
>>>>>>
>>>>>> On Thu, Oct 26, 2017 at 12:15 PM, Vaghawan Ojha <
>>>>>> vaghawan781@gmail.com> wrote:
>>>>>>
>>>>>>> "Executor failed to connect with master ", are you sure the --master
>>>>>>> spark://*.*.*.*:7077 is correct?
>>>>>>>
>>>>>>> Like the one you copied from the spark master's web ui? sometimes
>>>>>>> having that wrong fails to connect with the spark master.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> On Thu, Oct 26, 2017 at 12:02 PM, Abhimanyu Nagrath <
>>>>>>> abhimanyunagrath@gmail.com> wrote:
>>>>>>>
>>>>>>>> I am new to predictionIO . I am using template
>>>>>>>> https://github.com/EmergentOrder/template-scala-probabilisti
>>>>>>>> c-classifier-batch-lbfgs.
>>>>>>>>
>>>>>>>> My training dataset count is 1184603 having approx 6500 features.
I
>>>>>>>> am using ec2 r4.8xlarge system (240 GB RAM, 32 Cores, 200
GB Swap).
>>>>>>>>
>>>>>>>>
>>>>>>>> I tried two ways for training
>>>>>>>>
>>>>>>>>  1. Command '
>>>>>>>>
>>>>>>>> > pio train -- --driver-memory 120G --executor-memory
100G -- conf
>>>>>>>> > spark.network.timeout=10000000
>>>>>>>>
>>>>>>>> '
>>>>>>>>   Its throwing exception after 3-4 hours.
>>>>>>>>
>>>>>>>>
>>>>>>>>     Exception in thread "main" org.apache.spark.SparkException:
>>>>>>>> Job aborted due to stage failure: Task 0 in stage 1.0 failed
1 times, most
>>>>>>>> recent failure: Lost task 0.0 in stage 1.0 (TID 15, localhost,
executor
>>>>>>>> driver): ExecutorLostFailure (executor driver exited caused
by one of the
>>>>>>>> running tasks) Reason: Executor heartbeat timed out after
181529 ms
>>>>>>>>     Driver stacktrace:
>>>>>>>>             at org.apache.spark.scheduler.DAGScheduler.org
>>>>>>>> $apache$spark$scheduler$DAGScheduler$$failJobAn
>>>>>>>> dIndependentStages(DAGScheduler.scala:1435)
>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>>>>>>>>             at scala.collection.mutable.Resiz
>>>>>>>> ableArray$class.foreach(ResizableArray.scala:59)
>>>>>>>>             at scala.collection.mutable.Array
>>>>>>>> Buffer.foreach(ArrayBuffer.scala:48)
>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>> Scheduler.abortStage(DAGScheduler.scala:1422)
>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.
>>>>>>>> scala:802)
>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.
>>>>>>>> scala:802)
>>>>>>>>             at scala.Option.foreach(Option.scala:257)
>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>> Scheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>> SchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>>>>>>>>             at org.apache.spark.util.EventLoo
>>>>>>>> p$$anon$1.run(EventLoop.scala:48)
>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>> Scheduler.runJob(DAGScheduler.scala:628)
>>>>>>>>             at org.apache.spark.SparkContext.
>>>>>>>> runJob(SparkContext.scala:1918)
>>>>>>>>             at org.apache.spark.SparkContext.
>>>>>>>> runJob(SparkContext.scala:1931)
>>>>>>>>             at org.apache.spark.SparkContext.
>>>>>>>> runJob(SparkContext.scala:1944)
>>>>>>>>             at org.apache.spark.rdd.RDD$$anon
>>>>>>>> fun$take$1.apply(RDD.scala:1353)
>>>>>>>>             at org.apache.spark.rdd.RDDOperat
>>>>>>>> ionScope$.withScope(RDDOperationScope.scala:151)
>>>>>>>>             at org.apache.spark.rdd.RDDOperat
>>>>>>>> ionScope$.withScope(RDDOperationScope.scala:112)
>>>>>>>>             at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>>>>>>>>             at org.apache.spark.rdd.RDD.take(RDD.scala:1326)
>>>>>>>>             at org.example.classification.Log
>>>>>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi
>>>>>>>> thLBFGSAlgorithm.scala:28)
>>>>>>>>             at org.example.classification.Log
>>>>>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi
>>>>>>>> thLBFGSAlgorithm.scala:21)
>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>> ller.P2LAlgorithm.trainBase(P2LAlgorithm.scala:49)
>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692)
>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692)
>>>>>>>>             at scala.collection.TraversableLi
>>>>>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234)
>>>>>>>>             at scala.collection.TraversableLi
>>>>>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234)
>>>>>>>>             at scala.collection.immutable.Lis
>>>>>>>> t.foreach(List.scala:381)
>>>>>>>>             at scala.collection.TraversableLi
>>>>>>>> ke$class.map(TraversableLike.scala:234)
>>>>>>>>             at scala.collection.immutable.List.map(List.scala:285)
>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>> ller.Engine$.train(Engine.scala:692)
>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>> ller.Engine.train(Engine.scala:177)
>>>>>>>>             at org.apache.predictionio.workfl
>>>>>>>> ow.CoreWorkflow$.runTrain(CoreWorkflow.scala:67)
>>>>>>>>             at org.apache.predictionio.workfl
>>>>>>>> ow.CreateWorkflow$.main(CreateWorkflow.scala:250)
>>>>>>>>             at org.apache.predictionio.workfl
>>>>>>>> ow.CreateWorkflow.main(CreateWorkflow.scala)
>>>>>>>>             at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>> Method)
>>>>>>>>             at sun.reflect.NativeMethodAccess
>>>>>>>> orImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>>>>             at sun.reflect.DelegatingMethodAc
>>>>>>>> cessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>             at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>> ubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSub
>>>>>>>> mit.scala:738)
>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>> ubmit$.doRunMain$1(SparkSubmit.scala:187)
>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>> ubmit$.submit(SparkSubmit.scala:212)
>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>> ubmit$.main(SparkSubmit.scala:126)
>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>> ubmit.main(SparkSubmit.scala)
>>>>>>>>
>>>>>>>> 2. I started spark standalone cluster with 1 master and 3
workers
>>>>>>>> and executed the command
>>>>>>>>
>>>>>>>> > pio train -- --master spark://*.*.*.*:7077 --driver-memory
50G
>>>>>>>> > --executor-memory 50G
>>>>>>>>
>>>>>>>> And after some times getting the error . Executor failed
to connect
>>>>>>>> with master and training gets stopped.
>>>>>>>>
>>>>>>>> I have changed the feature count from 6500 - > 500 and
still the
>>>>>>>> condition is same. So can anyone suggest me am I missing
something
>>>>>>>>
>>>>>>>> and In between training getting continuous warnings like
:
>>>>>>>> [
>>>>>>>>
>>>>>>>> > WARN] [ScannerCallable] Ignore, probably already closed
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Abhimanyu
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message