predictionio-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Donald Szeto <don...@apache.org>
Subject Re: Not able to train data
Date Thu, 26 Oct 2017 17:40:45 GMT
How is your HBase deployment like? Is it running on a single machine, or
clustered? Is it configured to use the embedded ZooKeeper? If you are using
a single machine deployment with embedded ZooKeeper, it is known to be a
pretty unstable combination.

If the input data that you are testing with already exists as flat files on
HDFS or S3, I'll recommend modifying your template's data source to
directly read from those files for your proof-of-concept.

On Thu, Oct 26, 2017 at 10:27 AM, Abhimanyu Nagrath <
abhimanyunagrath@gmail.com> wrote:

> Hi Donald,
>
> Checked pio.log and found the following error while training :
>
> 1. ERROR org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher [main] -
> hconnection-0x21325036, quorum=localhost:2181, baseZNode=/hbase Received
> unexpected KeeperException, re-throwing exception
>
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
>
>         at org.apache.zookeeper.KeeperException.create(
> KeeperException.java:99)
>
>         at org.apache.zookeeper.KeeperException.create(
> KeeperException.java:51)
>
>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
>
>         at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(
> RecoverableZooKeeper.java:199)
>
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(
> ZKUtil.java:479)
>
>         at org.apache.hadoop.hbase.zookeeper.ZKClusterId.
> readClusterIdZNode(ZKClusterId.java:65)
>
>         at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(
> ZooKeeperRegistry.java:83)
>
>         at org.apache.hadoop.hbase.client.HConnectionManager$
> HConnectionImplementation.retrieveClusterId(HConnectionManager.java:852)
>
>         at org.apache.hadoop.hbase.client.HConnectionManager$
> HConnectionImplementation.<init>(HConnectionManager.java:657)
>
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:62)
>
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
>
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>
>         at org.apache.hadoop.hbase.client.HConnectionManager.
> createConnection(HConnectionManager.java:409)
>
>         at org.apache.hadoop.hbase.client.HConnectionManager.
> createConnection(HConnectionManager.java:388)
>
>         at org.apache.hadoop.hbase.client.HConnectionManager.
> getConnection(HConnectionManager.java:269)
>
>         at org.apache.hadoop.hbase.client.HBaseAdmin.checkHBaseAvailable(
> HBaseAdmin.java:2338)
>
>         at org.apache.predictionio.data.storage.hbase.StorageClient.<
> init>(StorageClient.scala:53)
>
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:62)
>
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
>
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>
>         at org.apache.predictionio.data.storage.Storage$.getClient(
> Storage.scala:223)
>
>         at org.apache.predictionio.data.storage.Storage$.org$apache$
> predictionio$data$storage$Storage$$updateS2CM(Storage.scala:254)
>
>         at org.apache.predictionio.data.storage.Storage$$anonfun$
> sourcesToClientMeta$1.apply(Storage.scala:215)
>
>         at org.apache.predictionio.data.storage.Storage$$anonfun$
> sourcesToClientMeta$1.apply(Storage.scala:215)
>
>         at scala.collection.mutable.MapLike$class.getOrElseUpdate(
> MapLike.scala:194)
>
>         at scala.collection.mutable.AbstractMap.getOrElseUpdate(
> Map.scala:80)
>
>         at org.apache.predictionio.data.storage.Storage$.
> sourcesToClientMeta(Storage.scala:215)
>
>         at org.apache.predictionio.data.storage.Storage$.
> getDataObject(Storage.scala:284)
>
>         at org.apache.predictionio.data.storage.Storage$.
> getDataObjectFromRepo(Storage.scala:269)
>
>         at org.apache.predictionio.data.storage.Storage$.getLEvents(
> Storage.scala:417)
>
>         at org.apache.predictionio.data.api.EventServer$.
> createEventServer(EventServer.scala:616)
>
>         at org.apache.predictionio.tools.commands.Management$.
> eventserver(Management.scala:77)
>
>         at org.apache.predictionio.tools.console.Pio$.eventserver(Pio.
> scala:113)
>
>         at org.apache.predictionio.tools.console.Console$$anonfun$main$
> 1.apply(Console.scala:650)
>
>         at org.apache.predictionio.tools.console.Console$$anonfun$main$
> 1.apply(Console.scala:611)
>
>         at scala.Option.map(Option.scala:146)
>
>         at org.apache.predictionio.tools.console.Console$.main(Console.
> scala:611)
>
> /2017-10-25
>
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2196)
>
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
>
>         at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
> RpcExecutor.java:133)
>
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.
> java:108)
>
>         at java.lang.Thread.run(Thread.java:748)
>
>
>         at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1457)
>
>         at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(
> RpcClient.java:1661)
>
>         at org.apache.hadoop.hbase.ipc.RpcClient$
> BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1719)
>
>         at org.apache.hadoop.hbase.protobuf.generated.
> ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:29990)
>
>         at org.apache.hadoop.hbase.client.ScannerCallable.close(
> ScannerCallable.java:291)
>
>         at org.apache.hadoop.hbase.client.ScannerCallable.call(
> ScannerCallable.java:160)
>
>         at org.apache.hadoop.hbase.client.ScannerCallable.call(
> ScannerCallable.java:59)
>
>         at org.apache.hadoop.hbase.client.RpcRetryingCaller.
> callWithRetries(RpcRetryingCaller.java:114)
>
>         at org.apache.hadoop.hbase.client.RpcRetryingCaller.
> callWithRetries(RpcRetryingCaller.java:90)
>
>         at org.apache.hadoop.hbase.client.ClientScanner.close(
> ClientScanner.java:450)
>
>         at org.apache.hadoop.hbase.mapreduce.
> TableRecordReaderImpl.restart(TableRecordReaderImpl.java:88)
>
>         at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
> nextKeyValue(TableRecordReaderImpl.java:229)
>
>         at org.apache.hadoop.hbase.mapreduce.TableRecordReader.
> nextKeyValue(TableRecordReader.java:138)
>
>         at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(
> NewHadoopRDD.scala:199)
>
>         at org.apache.spark.InterruptibleIterator.hasNext(
> InterruptibleIterator.scala:39)
>
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>
>         at org.apache.spark.util.collection.ExternalSorter.
> insertAll(ExternalSorter.scala:191)
>
>         at org.apache.spark.shuffle.sort.SortShuffleWriter.write(
> SortShuffleWriter.scala:63)
>
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:96)
>
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:53)
>
>         at org.apache.spark.scheduler.Task.run(Task.scala:99)
>
>         at org.apache.spark.executor.Executor$TaskRunner.run(
> Executor.scala:282)
>
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)
>
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)
>
>         at java.lang.Thread.run(Thread.java:748)
>
>  2.  ERROR org.apache.spark.scheduler.TaskSchedulerImpl
> [dispatcher-event-loop-26] - Lost executor driver on localhost: Executor
> heartbeat timed out after 181529 ms
>
> 3. 2017-10-25 22:46:19,747 ERROR org.apache.spark.storage.BlockManager
> [driver-heartbeater] - Failed to report rdd_8_0 to master; giving up.
>
> 4. ERROR org.apache.spark.scheduler.LiveListenerBus [Thread-3] -
> SparkListenerBus has already stopped! Dropping event
> SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@5621bf01)
>
> Are these due to resource limitation ?
>
>
>
> On Thu, Oct 26, 2017 at 9:30 PM, Donald Szeto <donald@apache.org> wrote:
>
>> Hi Abhimanyu,
>>
>> Is there more information from Spark web UI, or pio.log from where you
>> run the `pio train` command? Also, sharing your full modifications
>> somewhere on GitHub will be very helpful.
>>
>> Regards,
>> Donald
>>
>> On Thu, Oct 26, 2017 at 2:22 AM Abhimanyu Nagrath <
>> abhimanyunagrath@gmail.com> wrote:
>>
>>> HI Vaghawan,
>>> Thanks for the reply. Yes already tried that but still its same getting
>>> same error .
>>>
>>> Regards,
>>> Abhimanyu
>>>
>>> On Thu, Oct 26, 2017 at 2:40 PM, Vaghawan Ojha <vaghawan781@gmail.com>
>>> wrote:
>>>
>>>> Hi Abhimanyu,
>>>>
>>>> I've never tried the classification template, So I'm not sure about how
>>>> much time would it exactly take. But as per your error, your model is not
>>>> going any far from stage 1. "Task 0 in stage 1.0 failed 1 times, " .
>>>>
>>>> Probably something to do with the OOMs. https://stackoverflow.co
>>>> m/questions/37260230/spark-cluster-full-of-heartbeat-timeout
>>>> s-executors-exiting-on-their-own
>>>>
>>>> did you see this?
>>>>
>>>> On Thu, Oct 26, 2017 at 1:57 PM, Abhimanyu Nagrath <
>>>> abhimanyunagrath@gmail.com> wrote:
>>>>
>>>>> Hi Vaghawan,
>>>>>
>>>>> For debugging I just made a change I just reduced the number if
>>>>> features to 1  record count being the same as 1 Million and hardware
is
>>>>> (240 GB RAM , 32 cores and 100 GB SWAP) and training is still going on
>>>>> since 2 hrs.Is it an expected behavior. On which factors does the training
>>>>> time depend.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Abhimanyu
>>>>>
>>>>>
>>>>> On Thu, Oct 26, 2017 at 12:41 PM, Abhimanyu Nagrath <
>>>>> abhimanyunagrath@gmail.com> wrote:
>>>>>
>>>>>> Hi Vaghawan,
>>>>>>
>>>>>> I have made that template compatible with the version mentioned
>>>>>> above. Changed versions of engine.json and changed packages name.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Abhimanyu
>>>>>>
>>>>>> On Thu, Oct 26, 2017 at 12:39 PM, Vaghawan Ojha <
>>>>>> vaghawan781@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Abhimanyu,
>>>>>>>
>>>>>>> I don't think this template works with version 0.11.0. As per
the
>>>>>>> template :
>>>>>>>
>>>>>>> update for PredictionIO 0.9.2, including:
>>>>>>>
>>>>>>> I don't think it supports the latest pio. You rather switch it
to
>>>>>>> 0.9.2 if you want to experiment it.
>>>>>>>
>>>>>>> On Thu, Oct 26, 2017 at 12:52 PM, Abhimanyu Nagrath <
>>>>>>> abhimanyunagrath@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Vaghawan ,
>>>>>>>>
>>>>>>>> I am using v0.11.0-incubating with (ES - v5.2.1 , Hbase -
1.2.6 ,
>>>>>>>> Spark - 2.1.0).
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Abhimanyu
>>>>>>>>
>>>>>>>> On Thu, Oct 26, 2017 at 12:31 PM, Vaghawan Ojha <
>>>>>>>> vaghawan781@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Abhimanyu,
>>>>>>>>>
>>>>>>>>> Ok, which version of pio is this? Because the template
looks old
>>>>>>>>> to me.
>>>>>>>>>
>>>>>>>>> On Thu, Oct 26, 2017 at 12:44 PM, Abhimanyu Nagrath <
>>>>>>>>> abhimanyunagrath@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Vaghawan,
>>>>>>>>>>
>>>>>>>>>> yes, the spark master connection string is correct
I am getting
>>>>>>>>>> executor fails to connect to spark master after 4-5
hrs.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Abhimanyu
>>>>>>>>>>
>>>>>>>>>> On Thu, Oct 26, 2017 at 12:17 PM, Sachin Kamkar <
>>>>>>>>>> sachinkamkar@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> It should be correct, as the user got the exception
after 3-4
>>>>>>>>>>> hours of starting. So looks like something else
broke. OOM?
>>>>>>>>>>>
>>>>>>>>>>> With Regards,
>>>>>>>>>>>
>>>>>>>>>>>      Sachin
>>>>>>>>>>> ⚜KTBFFH⚜
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Oct 26, 2017 at 12:15 PM, Vaghawan Ojha
<
>>>>>>>>>>> vaghawan781@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> "Executor failed to connect with master ",
are you sure the --master
>>>>>>>>>>>> spark://*.*.*.*:7077 is correct?
>>>>>>>>>>>>
>>>>>>>>>>>> Like the one you copied from the spark master's
web ui?
>>>>>>>>>>>> sometimes having that wrong fails to connect
with the spark master.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Oct 26, 2017 at 12:02 PM, Abhimanyu
Nagrath <
>>>>>>>>>>>> abhimanyunagrath@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I am new to predictionIO . I am using
template
>>>>>>>>>>>>> https://github.com/EmergentOrder/template-scala-probabilisti
>>>>>>>>>>>>> c-classifier-batch-lbfgs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My training dataset count is 1184603
having approx 6500
>>>>>>>>>>>>> features. I am using ec2 r4.8xlarge system
(240 GB RAM, 32 Cores, 200 GB
>>>>>>>>>>>>> Swap).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I tried two ways for training
>>>>>>>>>>>>>
>>>>>>>>>>>>>  1. Command '
>>>>>>>>>>>>>
>>>>>>>>>>>>> > pio train -- --driver-memory 120G
--executor-memory 100G --
>>>>>>>>>>>>> conf
>>>>>>>>>>>>> > spark.network.timeout=10000000
>>>>>>>>>>>>>
>>>>>>>>>>>>> '
>>>>>>>>>>>>>   Its throwing exception after 3-4 hours.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>     Exception in thread "main" org.apache.spark.SparkException:
>>>>>>>>>>>>> Job aborted due to stage failure: Task
0 in stage 1.0 failed 1 times, most
>>>>>>>>>>>>> recent failure: Lost task 0.0 in stage
1.0 (TID 15, localhost, executor
>>>>>>>>>>>>> driver): ExecutorLostFailure (executor
driver exited caused by one of the
>>>>>>>>>>>>> running tasks) Reason: Executor heartbeat
timed out after 181529 ms
>>>>>>>>>>>>>     Driver stacktrace:
>>>>>>>>>>>>>             at org.apache.spark.scheduler.DAGScheduler.org
>>>>>>>>>>>>> $apache$spark$scheduler$DAGScheduler$$failJobAn
>>>>>>>>>>>>> dIndependentStages(DAGScheduler.scala:1435)
>>>>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>>>>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>>>>>>>>>>>>>             at scala.collection.mutable.Resiz
>>>>>>>>>>>>> ableArray$class.foreach(ResizableArray.scala:59)
>>>>>>>>>>>>>             at scala.collection.mutable.Array
>>>>>>>>>>>>> Buffer.foreach(ArrayBuffer.scala:48)
>>>>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>>>>> Scheduler.abortStage(DAGScheduler.scala:1422)
>>>>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.
>>>>>>>>>>>>> scala:802)
>>>>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.
>>>>>>>>>>>>> scala:802)
>>>>>>>>>>>>>             at scala.Option.foreach(Option.scala:257)
>>>>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>>>>> Scheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>>>>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>>>>> SchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>>>>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>>>>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>>>>>>>>>>>>>             at org.apache.spark.util.EventLoo
>>>>>>>>>>>>> p$$anon$1.run(EventLoop.scala:48)
>>>>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>>>>> Scheduler.runJob(DAGScheduler.scala:628)
>>>>>>>>>>>>>             at org.apache.spark.SparkContext.
>>>>>>>>>>>>> runJob(SparkContext.scala:1918)
>>>>>>>>>>>>>             at org.apache.spark.SparkContext.
>>>>>>>>>>>>> runJob(SparkContext.scala:1931)
>>>>>>>>>>>>>             at org.apache.spark.SparkContext.
>>>>>>>>>>>>> runJob(SparkContext.scala:1944)
>>>>>>>>>>>>>             at org.apache.spark.rdd.RDD$$anon
>>>>>>>>>>>>> fun$take$1.apply(RDD.scala:1353)
>>>>>>>>>>>>>             at org.apache.spark.rdd.RDDOperat
>>>>>>>>>>>>> ionScope$.withScope(RDDOperationScope.scala:151)
>>>>>>>>>>>>>             at org.apache.spark.rdd.RDDOperat
>>>>>>>>>>>>> ionScope$.withScope(RDDOperationScope.scala:112)
>>>>>>>>>>>>>             at org.apache.spark.rdd.RDD.withS
>>>>>>>>>>>>> cope(RDD.scala:362)
>>>>>>>>>>>>>             at org.apache.spark.rdd.RDD.take(RDD.scala:1326)
>>>>>>>>>>>>>             at org.example.classification.Log
>>>>>>>>>>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi
>>>>>>>>>>>>> thLBFGSAlgorithm.scala:28)
>>>>>>>>>>>>>             at org.example.classification.Log
>>>>>>>>>>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi
>>>>>>>>>>>>> thLBFGSAlgorithm.scala:21)
>>>>>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>>>>>> ller.P2LAlgorithm.trainBase(P2LAlgorithm.scala:49)
>>>>>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>>>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692)
>>>>>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>>>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692)
>>>>>>>>>>>>>             at scala.collection.TraversableLi
>>>>>>>>>>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234)
>>>>>>>>>>>>>             at scala.collection.TraversableLi
>>>>>>>>>>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234)
>>>>>>>>>>>>>             at scala.collection.immutable.Lis
>>>>>>>>>>>>> t.foreach(List.scala:381)
>>>>>>>>>>>>>             at scala.collection.TraversableLi
>>>>>>>>>>>>> ke$class.map(TraversableLike.scala:234)
>>>>>>>>>>>>>             at scala.collection.immutable.Lis
>>>>>>>>>>>>> t.map(List.scala:285)
>>>>>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>>>>>> ller.Engine$.train(Engine.scala:692)
>>>>>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>>>>>> ller.Engine.train(Engine.scala:177)
>>>>>>>>>>>>>             at org.apache.predictionio.workfl
>>>>>>>>>>>>> ow.CoreWorkflow$.runTrain(CoreWorkflow.scala:67)
>>>>>>>>>>>>>             at org.apache.predictionio.workfl
>>>>>>>>>>>>> ow.CreateWorkflow$.main(CreateWorkflow.scala:250)
>>>>>>>>>>>>>             at org.apache.predictionio.workfl
>>>>>>>>>>>>> ow.CreateWorkflow.main(CreateWorkflow.scala)
>>>>>>>>>>>>>             at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>>>>>>> Method)
>>>>>>>>>>>>>             at sun.reflect.NativeMethodAccess
>>>>>>>>>>>>> orImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>>>>>>>>>             at sun.reflect.DelegatingMethodAc
>>>>>>>>>>>>> cessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>>>>>>             at java.lang.reflect.Method.invok
>>>>>>>>>>>>> e(Method.java:498)
>>>>>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>>>>>> ubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSub
>>>>>>>>>>>>> mit.scala:738)
>>>>>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>>>>>> ubmit$.doRunMain$1(SparkSubmit.scala:187)
>>>>>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>>>>>> ubmit$.submit(SparkSubmit.scala:212)
>>>>>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>>>>>> ubmit$.main(SparkSubmit.scala:126)
>>>>>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>>>>>> ubmit.main(SparkSubmit.scala)
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. I started spark standalone cluster
with 1 master and 3
>>>>>>>>>>>>> workers and executed the command
>>>>>>>>>>>>>
>>>>>>>>>>>>> > pio train -- --master spark://*.*.*.*:7077
--driver-memory
>>>>>>>>>>>>> 50G
>>>>>>>>>>>>> > --executor-memory 50G
>>>>>>>>>>>>>
>>>>>>>>>>>>> And after some times getting the error
. Executor failed to
>>>>>>>>>>>>> connect with master and training gets
stopped.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have changed the feature count from
6500 - > 500 and still
>>>>>>>>>>>>> the condition is same. So can anyone
suggest me am I missing something
>>>>>>>>>>>>>
>>>>>>>>>>>>> and In between training getting continuous
warnings like :
>>>>>>>>>>>>> [
>>>>>>>>>>>>>
>>>>>>>>>>>>> > WARN] [ScannerCallable] Ignore,
probably already closed
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Abhimanyu
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Mime
View raw message