accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Slacum <wilhelm.von.cl...@accumulo.net>
Subject Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection
Date Thu, 16 Aug 2012 11:24:07 GMT
What does your TServer debug log say? Also, are you writing back out to
Accumulo?

To follow up what Jim said, you can check the zookeeper log to see if max
connections is being hit. You may also want to check and see what your max
xceivers is set to for HDFS and check your Accumulo and HDFS logs to see if
it is mentioned.

On Thu, Aug 16, 2012 at 3:59 AM, Arjumand Bonhomme <jumand@gmail.com> wrote:

> Hello,
>
> I'm fairly new to both Accumulo and Hadoop, so I think my problem may be
> due to poor configuration on my part, but I'm running out of ideas.
>
> I'm running this on a mac laptop, with hadoop (hadoop-0.20.2 from cdh3u4)
> in pseudo-distributed mode.
> zookeeper version zookeeper-3.3.5 from cdh3u4
> I'm using the 1.4.1 release of accumulo with a configuration copied from
> "conf/examples/512MB/standalone"
>
> I've got a Map task that is using an accumulo table as the input.
> I'm fetching all rows, but just a single column family, that has hundreds
> or even thousands of different column qualifiers.
> The table has a SummingCombiner installed for the given the column family.
>
> The task runs fine at first, but after ~9-15K records (I print the record
> count to the console every 1K records), it hangs and the following messages
> are printed to the console where I'm running the job:
> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read additional
> data from server sessionid 0x1392cc35b460d1c, likely server has closed
> socket, closing socket connection and attempting reconnect
> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket connection to
> server localhost/fe80:0:0:0:0:0:0:1%1:2181
> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Socket connection established
> to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to
> ZooKeeper service, session 0x1392cc35b460d1c has expired, closing socket
> connection
> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread shut down
> 12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client connection,
> connectString=localhost sessionTimeout=30000
> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c
> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket connection to
> server localhost/0:0:0:0:0:0:0:1:2181
> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection established
> to localhost/0:0:0:0:0:0:0:1:2181, initiating session
> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment
> complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid =
> 0x1392cc35b460d25, negotiated timeout = 30000
> 12/08/16 02:57:11 INFO mapred.LocalJobRunner:
> 12/08/16 02:57:14 INFO mapred.LocalJobRunner:
> 12/08/16 02:57:17 INFO mapred.LocalJobRunner:
>
> Sometimes the messages contain a stacktrace like this below:
> 12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session 0x1392cc35b460b40 for
> server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected error, closing
> socket connection and attempting reconnect
> java.io.IOException: Connection reset by peer
>  at sun.nio.ch.FileDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
> at sun.nio.ch.IOUtil.read(IOUtil.java:166)
>  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
> at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856)
>  at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154)
> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket connection to
> server localhost/127.0.0.1:2181
> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection established
> to localhost/127.0.0.1:2181, initiating session
> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unable to reconnect to
> ZooKeeper service, session 0x1392cc35b460b40 has expired, closing socket
> connection
> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: EventThread shut down
> 12/08/16 01:57:41 INFO zookeeper.ZooKeeper: Initiating client connection,
> connectString=localhost sessionTimeout=30000
> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@684a26e8
> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Opening socket connection to
> server localhost/fe80:0:0:0:0:0:0:1%1:2181
> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Socket connection established
> to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Session establishment
> complete on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid =
> 0x1392cc35b460b46, negotiated timeout = 30000
>
>
> I've poked through the logs in accumulo, and I've noticed that when it
> hangs, the following is written to the "logger_HOSTNAME.debug.log" file:
> 16 03:29:46,332 [logger.LogService] DEBUG: event null None Disconnected
> 16 03:29:47,248 [zookeeper.ZooSession] DEBUG: Session expired, state of
> current session : Expired
> 16 03:29:47,248 [logger.LogService] DEBUG: event null None Expired
> 16 03:29:47,249 [logger.LogService] WARN : Logger lost zookeeper
> registration at null
> 16 03:29:47,452 [logger.LogService] INFO : Logger shutting down
> 16 03:29:47,453 [logger.LogWriter] INFO : Shutting down
>
>
> I've noticed that if I make the map task print out the record count more
> frequently (ie every 10 records), it seems to be able get through more
> records than when I only print every 1K records. My assumption was that
> this had something to do with more time being spent in the map task, and
> not fetching data from accumulo.  There was at least one occasion where I
> printed to the console for every record, and in that situation it managed
> to process 47K records, although I have been unable to repeat that behavior.
>
> I've also noticed that if I stop and start accumulo, the map-reduce job
> will pickup where it left off, but seems to fail quicker.
>
>
>
> Could someone make some suggestions as to what my problem might be? It
> would be greatly appreciated.  If you need any additional information from
> me, just let me know.  I'd paste my config files, driver setup, and example
> data into this post, but I think it's probably long enough already.
>
>
> Thanks in advance,
> -Arjumand
>
>

Mime
View raw message