Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7C81C200CD9 for ; Thu, 3 Aug 2017 17:32:30 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 7AF6D16BD36; Thu, 3 Aug 2017 15:32:30 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id CD3A016BD33 for ; Thu, 3 Aug 2017 17:32:28 +0200 (CEST) Received: (qmail 93088 invoked by uid 500); 3 Aug 2017 15:32:28 -0000 Mailing-List: contact user-help@predictionio.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@predictionio.incubator.apache.org Delivered-To: mailing list user@predictionio.incubator.apache.org Received: (qmail 93078 invoked by uid 99); 3 Aug 2017 15:32:27 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Aug 2017 15:32:27 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 6DF52C1D7F for ; Thu, 3 Aug 2017 15:32:27 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.003 X-Spam-Level: **** X-Spam-Status: No, score=4.003 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, HTML_OBFUSCATE_05_10=0.001, KAM_BADIPHTTP=2, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=occamsmachete-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id gdxg-T9UIqvI for ; Thu, 3 Aug 2017 15:32:14 +0000 (UTC) Received: from mail-qt0-f175.google.com (mail-qt0-f175.google.com [209.85.216.175]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 33FBC5F522 for ; Thu, 3 Aug 2017 15:32:13 +0000 (UTC) Received: by mail-qt0-f175.google.com with SMTP id 16so9701010qtz.4 for ; Thu, 03 Aug 2017 08:32:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=occamsmachete-com.20150623.gappssmtp.com; s=20150623; h=from:mime-version:subject:date:references:to:in-reply-to:message-id; bh=FiaDQDeMLuiuLnCR0OjqPc38EVrVlw4ZCzBoabhYQS8=; b=VLEo7GJXw9sxWjG+VwgPiBJ9iCsXp6+FipxAAZVp5Mz+vKFNnatEqu8tTi3gnlsl6j rzQttOTUpmOEeKI2PObT4tO9TBIc0+pBSpMDgvG0cjy1o8f9/6ztGs+KzrJDJDtWXIpq uewVpBAYhSTI4PjLeEN1NcDV/wvzVSXxNRAhz5lWLgsBAqVxeT3gi6hsaPhSJiDaCvOp MUUmvtuH3cyGwHDIlDkyxW/SdUqL5flePi+uUD5sPdjdgmfQxZYyNHHmTHQzwZPORtVL S2C2HwSn8slwgVrArfgp1mloFtZmj6e9nCSfyQVoSL3Pz+LVsUQgDKPbixqYvGRV0j66 U5qQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:mime-version:subject:date:references:to :in-reply-to:message-id; bh=FiaDQDeMLuiuLnCR0OjqPc38EVrVlw4ZCzBoabhYQS8=; b=GE0yZ5dDgMWXdJPs9uOP6oxkhlsb9M6xAV190YD2hYQoRmVCG0+G5nAahPPUErNpQl zYDJaucfLDM4UwmQsijyAsUN12FZpwd0l8TgxLGEozluzXrsGDggjYUD7YpUr9wss64w CxagES4W6RvgQMAY520AbANMxA0tcAp6qOAZaudkMkCThbuzSjv1W3WITgXdMqv+b0TR uOZl3BY6Dn4qrRuz7rZZHJKLFK7PLZwV6xkXPG0So+UBWASqb7wh06th9UkXDT5z1Fuj db3HV6RNpt7joPgXM3LeS93bq7fGry4nfzBYP7BIl+QzYgm8H3Zo36LA9M/urU2zE3nC aMdg== X-Gm-Message-State: AHYfb5g/MOkBLs6/HepL48/xwB+TsnXpDY0FIXaxny0vOf0knK7/rxqC rtFQpFVroHKSgCbh4mPg/g== X-Received: by 10.200.4.39 with SMTP id v39mr2782166qtg.335.1501774326187; Thu, 03 Aug 2017 08:32:06 -0700 (PDT) Received: from ip-192-168-220-4.ec2.internal (ec2-54-196-5-39.compute-1.amazonaws.com. [54.196.5.39]) by smtp.gmail.com with ESMTPSA id u68sm31652413qkg.58.2017.08.03.08.32.04 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 03 Aug 2017 08:32:04 -0700 (PDT) From: Pat Ferrel Content-Type: multipart/alternative; boundary="Apple-Mail=_ED83E177-1519-42E1-BAE5-47AB51D3CF6E" Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Error when importing data Date: Thu, 3 Aug 2017 08:32:02 -0700 References: <9ABCA6F2-1491-4986-AAF9-1898B9D0771E@occamsmachete.com> To: user@predictionio.incubator.apache.org In-Reply-To: Message-Id: <79AC6928-962F-45D5-AF6D-CFBE98EB2878@occamsmachete.com> X-Mailer: Apple Mail (2.3273) archived-at: Thu, 03 Aug 2017 15:32:30 -0000 --Apple-Mail=_ED83E177-1519-42E1-BAE5-47AB51D3CF6E Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii It should be easy to try a smaller batch of data first since we are just = guessing On Aug 2, 2017, at 11:22 PM, Carlos Vidal = wrote: Hello Mahesh, Pat Thanks for your answers. I will try with a bigger EC2 instance. Carlos. 2017-08-02 18:42 GMT+02:00 Pat Ferrel >: Actually memory may be your problem. Mahesh Hegde may be right about = trying smaller sets. Since it sounds like you have all services running = on one machine, they may be in contention for resources. On Aug 2, 2017, at 9:35 AM, Pat Ferrel > wrote: Something is not configured correctly `pio import` should work with any = size of file but this may be an undersized instance for that much data. Spark needs memory, it keeps all data that it needs for a particular = calculation spread across all cluster machines in memory. That includes = derived data so a total of 32g may not be enough. But that is not your = current problem. I would start by verifying that all components are working properly, = starting with HDFS, then HBase, then Spark, then Elasticsearch. I see = several storage backend errors below. On Aug 2, 2017, at 4:52 AM, Carlos Vidal > wrote: Hello, I have installed the pio + ur AMI in AWS, in an m4.2xlarge instance with = 32GB of RAM and 8 VCPU.=20 When I try to import a 20GB events file por my application, the system = crashes. The command I have used is: pio import --appid 4 --input my_events.json this command launch an spark job that needs to perform 800 task. When = the process reaches the task 211 it crashes. This is what I can see in = my pio.log file: 2017-08-02 11:16:17,101 WARN = org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio= n [htable-pool230-t1] - Encountered problems when prefetch hbase:meta = table:=20 org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after = attempts=3D35, exceptions: Wed Aug 02 11:07:06 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server = is in the failed servers list: localhost/127.0.0.1:44866 = Wed Aug 02 11:07:07 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server = is in the failed servers list: localhost/127.0.0.1:44866 = Wed Aug 02 11:07:07 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server = is in the failed servers list: localhost/127.0.0.1:44866 = Wed Aug 02 11:07:08 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server = is in the failed servers list: localhost/127.0.0.1:44866 = Wed Aug 02 11:07:10 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:07:14 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:07:24 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:07:34 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:07:44 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:07:54 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:08:15 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:08:35 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:08:55 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:09:15 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:09:35 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:09:55 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:10:15 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:10:35 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:10:55 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:11:15 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:11:35 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:11:55 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:12:15 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:12:35 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:12:56 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:13:16 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:13:36 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:13:56 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:14:16 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:14:36 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:14:56 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:15:16 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:15:36 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:15:56 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused Wed Aug 02 11:16:17 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = java.net.ConnectException: Connection refused at = org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryi= ngCaller.java:129) at = org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:714) at = org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:144) at = org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio= n.prefetchRegionCache(HConnectionManager.java:1153) at = org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio= n.locateRegionInMeta(HConnectionManager.java:1217) at = org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio= n.locateRegion(HConnectionManager.java:1105) at = org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio= n.locateRegion(HConnectionManager.java:1062) at = org.apache.hadoop.hbase.client.AsyncProcess.findDestLocation(AsyncProcess.= java:365) at = org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:507) at = org.apache.hadoop.hbase.client.AsyncProcess.logAndResubmit(AsyncProcess.ja= va:717) at = org.apache.hadoop.hbase.client.AsyncProcess.receiveGlobalFailure(AsyncProc= ess.java:664) at = org.apache.hadoop.hbase.client.AsyncProcess.access$100(AsyncProcess.java:9= 3) at = org.apache.hadoop.hbase.client.AsyncProcess$1.run(AsyncProcess.java:547) at = java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at = java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:= 1149) at = java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java= :624) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at = sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net = .SocketIOWithTimeout.connect(SocketIOWithTi= meout.java:206) at org.apache.hadoop.net = .NetUtils.connect(NetUtils.java:531) at org.apache.hadoop.net = .NetUtils.connect(NetUtils.java:495) at = org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupConnection(RpcClient= .java:578) at = org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.= java:868) at = org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1543) at = org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1442) at = org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:16= 61) at = org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.cal= lBlockingMethod(RpcClient.java:1719) at = org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$Bloc= kingStub.get(ClientProtos.java:29966) at = org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.= java:1508) at org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:710) at org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:708) at = org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryi= ngCaller.java:114) ... 17 more 2017-08-02 11:21:04,430 ERROR org.apache.spark.scheduler.LiveListenerBus = [Thread-3] - SparkListenerBus has already stopped! Dropping event = SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@66c4a5d2)= 2017-08-02 11:21:04,431 ERROR org.apache.spark.scheduler.LiveListenerBus = [Thread-3] - SparkListenerBus has already stopped! Dropping event = SparkListenerJobEnd(0,1501672864431,JobFailed(org.apache.spark.SparkExcept= ion: Job 0 cancelled because SparkContext was shut down)) 2017-08-02 11:28:47,129 INFO = org.apache.predictionio.tools.commands.Management$ [main] - Inspecting = PredictionIO... 2017-08-02 11:28:47,132 INFO = org.apache.predictionio.tools.commands.Management$ [main] - PredictionIO = 0.11.0-incubating is installed at = /opt/data/PredictionIO-0.11.0-incubating 2017-08-02 11:28:47,132 INFO = org.apache.predictionio.tools.commands.Management$ [main] - Inspecting = Apache Spark... 2017-08-02 11:28:47,142 INFO = org.apache.predictionio.tools.commands.Management$ [main] - Apache Spark = is installed at /usr/local/spark 2017-08-02 11:28:47,175 INFO = org.apache.predictionio.tools.commands.Management$ [main] - Apache Spark = 1.6.3 detected (meets minimum requirement of 1.3.0) 2017-08-02 11:28:47,175 INFO = org.apache.predictionio.tools.commands.Management$ [main] - Inspecting = storage backend connections... 2017-08-02 11:28:47,195 INFO = org.apache.predictionio.data.storage.Storage$ [main] - Verifying Meta = Data Backend (Source: ELASTICSEARCH)... 2017-08-02 11:28:48,225 INFO = org.apache.predictionio.data.storage.Storage$ [main] - Verifying Model = Data Backend (Source: HDFS)... 2017-08-02 11:28:48,447 INFO = org.apache.predictionio.data.storage.Storage$ [main] - Verifying Event = Data Backend (Source: HBASE)... 2017-08-02 11:28:48,979 INFO = org.apache.predictionio.data.storage.Storage$ [main] - Test writing to = Event Store (App Id 0)... 2017-08-02 11:29:49,026 ERROR = org.apache.predictionio.tools.commands.Management$ [main] - Unable to = connect to all storage backends successfully. On the other hand, once this happens, if I run pio status this is what I = obtain: aml@ip-10-41-11-227:~$ pio status SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in = [jar:file:/opt/data/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hdfs= -assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in = [jar:file:/opt/data/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.11.0= -incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings = for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] [INFO] [Management$] Inspecting PredictionIO... [INFO] [Management$] PredictionIO 0.11.0-incubating is installed at = /opt/data/PredictionIO-0.11.0-incubating [INFO] [Management$] Inspecting Apache Spark... [INFO] [Management$] Apache Spark is installed at /usr/local/spark [INFO] [Management$] Apache Spark 1.6.3 detected (meets minimum = requirement of 1.3.0) [INFO] [Management$] Inspecting storage backend connections... [INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)... [INFO] [Storage$] Verifying Model Data Backend (Source: HDFS)... [INFO] [Storage$] Verifying Event Data Backend (Source: HBASE)... [INFO] [Storage$] Test writing to Event Store (App Id 0)... [ERROR] [Management$] Unable to connect to all storage backends = successfully. The following shows the error message from the storage backend. Failed after attempts=3D1, exceptions: Wed Aug 02 11:45:04 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@43045f9f, java.net = .SocketTimeoutException: Call to = localhost/127.0.0.1:39562 failed because = java.net .SocketTimeoutException: 60000 millis timeout = while waiting for channel to be ready for read. ch : = java.nio.channels.SocketChannel[connected local=3D/127.0.0.1:51462 = remote=3Dlocalhost/127.0.0.1:39562 = ] (org.apache.hadoop.hbase.client.RetriesExhaustedException) Dumping configuration of initialized storage backend sources. Please make sure they are correct. Source Name: ELASTICSEARCH; Type: elasticsearch; Configuration: HOSTS -> = 127.0.0.1, TYPE -> elasticsearch, CLUSTERNAME -> elasticsearch Source Name: HBASE; Type: hbase; Configuration: TYPE -> hbase Source Name: HDFS; Type: hdfs; Configuration: TYPE -> hdfs, PATH -> = /models Do you know what is the problem? How can I restart the services once the = system fails?=20 Thanks. Carlos Vidal. --Apple-Mail=_ED83E177-1519-42E1-BAE5-47AB51D3CF6E Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii It should be easy to try a smaller batch of data first since = we are just guessing


On Aug 2, 2017, at 11:22 = PM, Carlos Vidal <carlos.vidal@beeva.com> wrote:

Hello Mahesh, Pat

Thanks for your answers. I will try with a bigger EC2 = instance.

Carlos.

2017-08-02 18:42 GMT+02:00 Pat = Ferrel <pat@occamsmachete.com>:
Actually memory may be your problem. Mahesh Hegde may be = right about trying smaller sets. Since it sounds like you have all = services running on one machine, they may be in contention for = resources.


On Aug 2, 2017, at 9:35 AM, Pat Ferrel <pat@occamsmachete.com> wrote:

Something is = not configured correctly `pio import` should work with any size of file = but this may be an undersized instance for that much data.

Spark needs memory, it = keeps all data that it needs for a particular calculation spread across = all cluster machines in memory. That includes derived data so a total of = 32g may not be enough. But that is not your current problem.

I would start by = verifying that all components are working properly, starting with HDFS, = then HBase, then Spark, then Elasticsearch. I see several storage = backend errors below.



On Aug 2, 2017, at 4:52 AM, Carlos Vidal = <carlos.vidal@beeva.com> wrote:

Hello,

I have = installed the pio + ur AMI in AWS, in an m4.2xlarge instance with 32GB = of RAM and 8 VCPU. 

When I try to import a 20GB events file por my application, = the system crashes. The command I have used is:


pio import --appid 4 --input = my_events.json

this command launch an spark job that needs to perform 800 = task. When the process reaches the task 211 it crashes. This is what I = can see in my pio.log file:

2017-08-02 11:16:17,101 WARN =  org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation = [htable-pool230-t1] - Encountered problems when prefetch hbase:meta = table: 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=3D35, = exceptions:
Wed Aug 02 11:07:06 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952,= org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in the failed = servers list: localhost/127.0.0.1:44866
Wed = Aug 02 11:07:07 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in the failed = servers list: localhost/127.0.0.1:44866
Wed = Aug 02 11:07:07 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in the failed = servers list: localhost/127.0.0.1:44866
Wed = Aug 02 11:07:08 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, = org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in the failed = servers list: localhost/127.0.0.1:44866
Wed = Aug 02 11:07:10 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: = Connection refused
Wed Aug 02 11:07:14 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952,= java.net.ConnectException: Connection refused
Wed = Aug 02 11:07:24 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: = Connection refused
Wed Aug 02 11:07:34 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952,= java.net.ConnectException: Connection refused
Wed = Aug 02 11:07:44 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: = Connection refused
Wed Aug 02 11:07:54 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952,= java.net.ConnectException: Connection refused
Wed = Aug 02 11:08:15 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: = Connection refused
Wed Aug 02 11:08:35 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952,= java.net.ConnectException: Connection refused
Wed = Aug 02 11:08:55 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: = Connection refused
Wed Aug 02 11:09:15 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952,= java.net.ConnectException: Connection refused
Wed = Aug 02 11:09:35 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: = Connection refused
Wed Aug 02 11:09:55 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952,= java.net.ConnectException: Connection refused
Wed = Aug 02 11:10:15 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: = Connection refused
Wed Aug 02 11:10:35 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952,= java.net.ConnectException: Connection refused
Wed = Aug 02 11:10:55 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: = Connection refused
Wed Aug 02 11:11:15 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952,= java.net.ConnectException: Connection refused
Wed = Aug 02 11:11:35 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: = Connection refused
Wed Aug 02 11:11:55 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952,= java.net.ConnectException: Connection refused
Wed = Aug 02 11:12:15 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: = Connection refused
Wed Aug 02 11:12:35 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952,= java.net.ConnectException: Connection refused
Wed = Aug 02 11:12:56 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: = Connection refused
Wed Aug 02 11:13:16 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952,= java.net.ConnectException: Connection refused
Wed = Aug 02 11:13:36 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: = Connection refused
Wed Aug 02 11:13:56 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952,= java.net.ConnectException: Connection refused
Wed = Aug 02 11:14:16 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: = Connection refused
Wed Aug 02 11:14:36 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952,= java.net.ConnectException: Connection refused
Wed = Aug 02 11:14:56 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: = Connection refused
Wed Aug 02 11:15:16 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952,= java.net.ConnectException: Connection refused
Wed = Aug 02 11:15:36 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: = Connection refused
Wed Aug 02 11:15:56 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952,= java.net.ConnectException: Connection refused
Wed = Aug 02 11:16:17 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: = Connection refused

at = org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:129)
at = org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:714)
at = org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:144)
at = org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:1153)
at = org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1217)
at = org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1105)
at = org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1062)
at = org.apache.hadoop.hbase.client.AsyncProcess.findDestLocation(AsyncProcess.java:365)
at = org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:507)
at = org.apache.hadoop.hbase.client.AsyncProcess.logAndResubmit(AsyncProcess.java:717)
at = org.apache.hadoop.hbase.client.AsyncProcess.receiveGlobalFailure(AsyncProcess.java:664)
at = org.apache.hadoop.hbase.client.AsyncProcess.access$100(AsyncProcess.java:93)
at = org.apache.hadoop.hbase.client.AsyncProcess$1.run(AsyncProcess.java:547)
at = java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at = java.util.concurrent.FutureTask.run(FutureTask.java:266)
at = java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at = java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at = java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection = refused
at = sun.nio.ch.SocketChannelImpl.checkConnect(Native = Method)
at = sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at = org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupConnection(RpcClient.java:578)
at = org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:868)
at = org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1543)
at = org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1442)
at = org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1661)
at = org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1719)
at = org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.get(ClientProtos.java:29966)
at = org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1508)
at = org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:710)
at = org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:708)
at = org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:114)
... 17 more
2017-08-02 11:21:04,430 ERROR = org.apache.spark.scheduler.LiveListenerBus [Thread-3] - = SparkListenerBus has already stopped! Dropping event = SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@66c4a5d2)
2017-08-02 11:21:04,431 = ERROR org.apache.spark.scheduler.LiveListenerBus = [Thread-3] - SparkListenerBus has already stopped! Dropping event = SparkListenerJobEnd(0,1501672864431,JobFailed(org.apache.spark.SparkException: Job 0 cancelled because SparkContext = was shut down))
2017-08-02 11:28:47,129 INFO =  org.apache.predictionio.tools.commands.Management$ = [main] - Inspecting PredictionIO...
2017-08-02 = 11:28:47,132 INFO  org.apache.predictionio.tools.commands.Management$ [main] - PredictionIO 0.11.0-incubating = is installed at /opt/data/PredictionIO-0.11.0-incubating
2017-08-02 11:28:47,132 INFO =  org.apache.predictionio.tools.commands.Management$ = [main] - Inspecting Apache Spark...
2017-08-02 = 11:28:47,142 INFO  org.apache.predictionio.tools.commands.Management$ [main] - Apache Spark is installed at = /usr/local/spark
2017-08-02 11:28:47,175 INFO =  org.apache.predictionio.tools.commands.Management$ = [main] - Apache Spark 1.6.3 detected (meets minimum requirement of = 1.3.0)
2017-08-02 11:28:47,175 INFO =  org.apache.predictionio.tools.commands.Management$ = [main] - Inspecting storage backend connections...
2017-08-02 11:28:47,195 INFO =  org.apache.predictionio.data.storage.Storage$ = [main] - Verifying Meta Data Backend (Source: = ELASTICSEARCH)...
2017-08-02 11:28:48,225 INFO =  org.apache.predictionio.data.storage.Storage$ = [main] - Verifying Model Data Backend (Source: HDFS)...
2017-08-02 11:28:48,447 INFO =  org.apache.predictionio.data.storage.Storage$ = [main] - Verifying Event Data Backend (Source: HBASE)...
2017-08-02 11:28:48,979 INFO =  org.apache.predictionio.data.storage.Storage$ = [main] - Test writing to Event Store (App Id 0)...
2017-08-02 11:29:49,026 ERROR = org.apache.predictionio.tools.commands.Management$ = [main] - Unable to connect to all storage backends = successfully.






On the other = hand, once this happens, if I run pio status this is what I = obtain:

aml@ip-10-41-11-227:~$ pio status
SLF4J: = Class path contains multiple SLF4J bindings.
SLF4J: = Found binding in [jar:file:/opt/data/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hdfs-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/data/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: = See http://www.slf4j.org/codes.html#multiple_bindings for an = explanation.
SLF4J: Actual binding is of type = [org.slf4j.impl.Log4jLoggerFactory]
[INFO] [Management$] Inspecting PredictionIO...
[INFO] [Management$] PredictionIO 0.11.0-incubating is = installed at /opt/data/PredictionIO-0.11.0-incubating
[INFO] [Management$] = Inspecting Apache Spark...
[INFO] [Management$] = Apache Spark is installed at /usr/local/spark
[INFO] = [Management$] Apache Spark 1.6.3 detected (meets minimum requirement of = 1.3.0)
[INFO] [Management$] Inspecting storage = backend connections...
[INFO] [Storage$] Verifying = Meta Data Backend (Source: ELASTICSEARCH)...
[INFO] = [Storage$] Verifying Model Data Backend (Source: HDFS)...
[INFO] [Storage$] Verifying Event Data Backend (Source: = HBASE)...
[INFO] [Storage$] Test writing to Event = Store (App Id 0)...
[ERROR] [Management$] Unable to = connect to all storage backends successfully.
The = following shows the error message from the storage backend.

Failed after attempts=3D1,= exceptions:
Wed Aug 02 11:45:04 UTC 2017, = org.apache.hadoop.hbase.client.RpcRetryingCaller@43045f9f,= java.net.SocketTimeoutException: Call to = localhost/127.0.0.1:39562 failed because java.net.SocketTimeoutException: 60000 = millis timeout while waiting for channel to be ready for read. ch : = java.nio.channels.SocketChannel[connected local=3D/127.0.0.1:51462 remote=3Dlocalhost/127.0.0.1:39562]
 (org.apache.hadoop.hbase.client.RetriesExhaustedException)

Dumping configuration of initialized = storage backend sources.
Please make sure they are = correct.

Source = Name: ELASTICSEARCH; Type: elasticsearch; Configuration: HOSTS -> = 127.0.0.1, TYPE -> elasticsearch, CLUSTERNAME -> = elasticsearch
Source Name: HBASE; Type: hbase; = Configuration: TYPE -> hbase
Source Name: HDFS; = Type: hdfs; Configuration: TYPE -> hdfs, PATH -> = /models

Do you know = what is the problem? How can I restart the services once the system = fails? 

Thanks.

Carlos = Vidal.




= --Apple-Mail=_ED83E177-1519-42E1-BAE5-47AB51D3CF6E--