Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 255CEDF77 for ; Thu, 16 Aug 2012 11:22:38 +0000 (UTC) Received: (qmail 74191 invoked by uid 500); 16 Aug 2012 11:22:37 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 74053 invoked by uid 500); 16 Aug 2012 11:22:36 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 74008 invoked by uid 99); 16 Aug 2012 11:22:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Aug 2012 11:22:35 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of klucar@gmail.com designates 74.125.82.169 as permitted sender) Received: from [74.125.82.169] (HELO mail-we0-f169.google.com) (74.125.82.169) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Aug 2012 11:22:28 +0000 Received: by weys10 with SMTP id s10so1907035wey.0 for ; Thu, 16 Aug 2012 04:22:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=references:from:in-reply-to:mime-version:date:message-id:subject:to :content-type; bh=6B0Und/eHXjz9jeLpYsbuyrbQwExkmFaC2scRe/y94U=; b=WOZ1G/MKzRvofcItfdlXwE5h+WdmBqqHoN9WGCqSHl2HIF5j5nN6sqfTjVZa1m5AHS 84LtXPQVAFPvXJfmuCPh3Z3P1mv6leXKJMVjzgsKfhb9n7pld1QXZHL3FgFlC9RIZQld ULzF3/bwj7Zvn3MciMTQqJ6Bzt2rI4zpeZrHegF3PI0TjkCXsnnMIapNNl+HSQw6JvJg TQKEn+JVkXaIThaRy8Hvimmu1iWl83CsUUSqRZCVBpLLC5EUkfg7VNgGTcKbzgGBtZiS t2phLW8VJYpt+/3594FmN2GKm3GYbG9QkvsSItL7jfSkIIPASAvZR0FnCZ1n9TfzVqHl V8Fg== Received: by 10.217.3.1 with SMTP id q1mr464735wes.38.1345116127251; Thu, 16 Aug 2012 04:22:07 -0700 (PDT) References: From: Jim Klucar In-Reply-To: Mime-Version: 1.0 (1.0) Date: Thu, 16 Aug 2012 07:22:03 -0400 Message-ID: <-1712857504089528880@unknownmsgid> Subject: Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection To: "user@accumulo.apache.org" Content-Type: multipart/alternative; boundary=20cf302079f4f24dd304c76040e1 X-Virus-Checked: Checked by ClamAV on apache.org --20cf302079f4f24dd304c76040e1 Content-Type: text/plain; charset=ISO-8859-1 Just shooting from the hip here. Zookeeper maxclientcxns in zoo.cfg should be increased from the default to something like 100. Check the zookeeper log file to see if it is shutting down connections. Check your what your max open files setting is for your OS with 'ulimit -n' and increase it if necessary. Sent from my iPhone On Aug 16, 2012, at 4:00 AM, Arjumand Bonhomme wrote: Hello, I'm fairly new to both Accumulo and Hadoop, so I think my problem may be due to poor configuration on my part, but I'm running out of ideas. I'm running this on a mac laptop, with hadoop (hadoop-0.20.2 from cdh3u4) in pseudo-distributed mode. zookeeper version zookeeper-3.3.5 from cdh3u4 I'm using the 1.4.1 release of accumulo with a configuration copied from "conf/examples/512MB/standalone" I've got a Map task that is using an accumulo table as the input. I'm fetching all rows, but just a single column family, that has hundreds or even thousands of different column qualifiers. The table has a SummingCombiner installed for the given the column family. The task runs fine at first, but after ~9-15K records (I print the record count to the console every 1K records), it hangs and the following messages are printed to the console where I'm running the job: 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x1392cc35b460d1c, likely server has closed socket, closing socket connection and attempting reconnect 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/fe80:0:0:0:0:0:0:1%1:2181 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Socket connection established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x1392cc35b460d1c has expired, closing socket connection 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread shut down 12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost sessionTimeout=30000 watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection established to localhost/0:0:0:0:0:0:0:1:2181, initiating session 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x1392cc35b460d25, negotiated timeout = 30000 12/08/16 02:57:11 INFO mapred.LocalJobRunner: 12/08/16 02:57:14 INFO mapred.LocalJobRunner: 12/08/16 02:57:17 INFO mapred.LocalJobRunner: Sometimes the messages contain a stacktrace like this below: 12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session 0x1392cc35b460b40 for server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198) at sun.nio.ch.IOUtil.read(IOUtil.java:166) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154) 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x1392cc35b460b40 has expired, closing socket connection 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: EventThread shut down 12/08/16 01:57:41 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost sessionTimeout=30000 watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@684a26e8 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/fe80:0:0:0:0:0:0:1%1:2181 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Socket connection established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid = 0x1392cc35b460b46, negotiated timeout = 30000 I've poked through the logs in accumulo, and I've noticed that when it hangs, the following is written to the "logger_HOSTNAME.debug.log" file: 16 03:29:46,332 [logger.LogService] DEBUG: event null None Disconnected 16 03:29:47,248 [zookeeper.ZooSession] DEBUG: Session expired, state of current session : Expired 16 03:29:47,248 [logger.LogService] DEBUG: event null None Expired 16 03:29:47,249 [logger.LogService] WARN : Logger lost zookeeper registration at null 16 03:29:47,452 [logger.LogService] INFO : Logger shutting down 16 03:29:47,453 [logger.LogWriter] INFO : Shutting down I've noticed that if I make the map task print out the record count more frequently (ie every 10 records), it seems to be able get through more records than when I only print every 1K records. My assumption was that this had something to do with more time being spent in the map task, and not fetching data from accumulo. There was at least one occasion where I printed to the console for every record, and in that situation it managed to process 47K records, although I have been unable to repeat that behavior. I've also noticed that if I stop and start accumulo, the map-reduce job will pickup where it left off, but seems to fail quicker. Could someone make some suggestions as to what my problem might be? It would be greatly appreciated. If you need any additional information from me, just let me know. I'd paste my config files, driver setup, and example data into this post, but I think it's probably long enough already. Thanks in advance, -Arjumand --20cf302079f4f24dd304c76040e1 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Just shooting from the hi= p here.

Zookeeper maxclientcxns in zoo.cfg should = be increased from the default to something like 100. Check the zookeeper lo= g file to see if it is shutting down connections.

Check your what your max open files setting is for your= OS with 'ulimit -n' and increase it if necessary.





Sent from my iPhone

On Aug 16, 2012, at 4:00 AM, Arjumand Bonhomme <jumand@gmail.com> wrote:

Hello,

I'm fairly n= ew to both Accumulo and Hadoop, so I think my problem may be due to poor co= nfiguration on my part, but I'm running out of ideas.

I'm running this on a mac laptop, with hadoop (hado= op-0.20.2 from cdh3u4) in pseudo-distributed mode.
zookeeper version=A0zookeeper-3.3.5 from cdh3u4
I'm usin= g the 1.4.1 release of accumulo with a configuration copied from "conf= /examples/512MB/standalone"

I've got a Map task that is using an accumulo table= as the input.
I'm fetching all rows, but just a single colum= n family, that has hundreds or even thousands of different column qualifier= s.
The table has a SummingCombiner installed for the given the column fam= ily.

The task runs fine at first, but after ~9-15K= records (I print the record count to the console every 1K records), it han= gs and the following messages are printed to the console where I'm runn= ing the job:
12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read addit= ional data from server sessionid 0x1392cc35b460d1c, likely server has close= d socket, closing socket connection and attempting reconnect
12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket connection to s= erver localhost/fe80:0:0:0:0:0:0:1%1:2181
12/08/16 02:57:08 INFO = zookeeper.ClientCnxn: Socket connection established to localhost/fe80:0:0:0= :0:0:0:1%1:2181, initiating session
12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to Zo= oKeeper service, session 0x1392cc35b460d1c has expired, closing socket conn= ection
12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread s= hut down
12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client connecti= on, connectString=3Dlocalhost sessionTimeout=3D30000 watcher=3Dorg.apache.a= ccumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c
12/08/= 16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket connection to server = localhost/0:0:0:0:0:0:0:1:2181
12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection establi= shed to localhost/0:0:0:0:0:0:0:1:2181, initiating session
12/08/= 16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment complete on se= rver localhost/0:0:0:0:0:0:0:1:2181, sessionid =3D 0x1392cc35b460d25, negot= iated timeout =3D 30000
12/08/16 02:57:11 INFO mapred.LocalJobRunner:=A0
12/08/16 02= :57:14 INFO mapred.LocalJobRunner:=A0
12/08/16 02:57:17 INFO mapr= ed.LocalJobRunner:=A0

Sometimes the messages= contain a stacktrace like this below:
12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session 0x1392cc35b4= 60b40 for server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected error, clo= sing socket connection and attempting reconnect
java.io.IOExcepti= on: Connection reset by peer
at su= n.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher= .read(SocketDispatcher.java:21)
at su= n.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
at sun.nio.ch.IOUtil.= read(IOUtil.java:166)
at su= n.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
at org.apache.z= ookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856)
at or= g.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154)
12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket connection to= server localhost/127.0.0.1:2181
12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection establi= shed to localhost/127.0.0.1:2181, ini= tiating session
12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unab= le to reconnect to ZooKeeper service, session 0x1392cc35b460b40 has expired= , closing socket connection
12/08/16 01:57:40 INFO zookeeper.ClientCnxn: EventThread shut down
12/08/16 01:57:41 INFO zookeeper.ZooKeeper: Initiating client connec= tion, connectString=3Dlocalhost sessionTimeout=3D30000 watcher=3Dorg.apache= .accumulo.core.zookeeper.ZooSession$AccumuloWatcher@684a26e8
12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Opening socket connection= to server localhost/fe80:0:0:0:0:0:0:1%1:2181
12/08/16 01:57:41 = INFO zookeeper.ClientCnxn: Socket connection established to localhost/fe80:= 0:0:0:0:0:0:1%1:2181, initiating session
12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Session establishment com= plete on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid =3D 0x1392cc= 35b460b46, negotiated timeout =3D 30000


I've poked through the logs in accumulo, and I've notice= d that when it hangs, the following is written to the "logger_HOSTNAME= .debug.log" file:
16 03:29:46,332 [logger.LogService] D= EBUG: event null None Disconnected
16 03:29:47,248 [zookeeper.ZooSession] DEBUG: Session expired, state o= f current session : Expired
16 03:29:47,248 [logger.LogService] D= EBUG: event null None Expired
16 03:29:47,249 [logger.LogService]= WARN : Logger lost zookeeper registration at null
16 03:29:47,452 [logger.LogService] INFO : Logger shutting down
<= div>16 03:29:47,453 [logger.LogWriter] INFO : Shutting down


I've noticed that if I make the map task= print out the record count more frequently (ie every 10 records), it seems= to be able get through more records than when I only print every 1K record= s. My assumption was that this had something to do with more time being spe= nt in the map task, and not fetching data from accumulo. =A0There was at le= ast one occasion where I printed to the console for every record, and in th= at situation it managed to process 47K records, although I have been unable= to repeat that behavior.

I've also noticed that if I stop and start accumulo= , the map-reduce job will pickup where it left off, but seems to fail quick= er.



Could someone ma= ke some suggestions as to what my problem might be? It would be greatly app= reciated. =A0If you need any additional information from me, just let me kn= ow. =A0I'd paste my config files, driver setup, and example data into t= his post, but I think it's probably long enough already.


Thanks in advance,
-Arjumand

--20cf302079f4f24dd304c76040e1--