accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arjumand Bonhomme <jum...@gmail.com>
Subject Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection
Date Thu, 16 Aug 2012 18:36:01 GMT
Jim, William, Adam: Thanks for your help with this, I'm running out of
ideas to google search for answers.

I'm going to try to answer your questions:

1) I had already adjusted the maxclientcxns in zoo.cfg to be 100.  I didn't
see anything in log indicating that it was shutting down connections due to
reaching a connection limit.  However, throughout the zookeeper logs, even
before the hadoop job was run.  I did see lots of lines like these.  Up til
this point, I've assumed it's innocuous and unrelated to my issue.
INFO  [Thread-371:NIOServerCnxn@1435] - Closed socket connection for client
/127.0.0.1:59158 (no session established for client)

2) I had also already adjusted the dfs.datanode.max.xcievers property to
4096 in hdfs-site.xml.  An investigation of the log shows that this limit
is not being reached.  It had been being reached at some point before I
increased the value from the default.  I had increased the value while
troubleshooting, but prior to posting to this list; it didn't appear to
have a noticeable affect on the behavior of the hadoop job.

3) I'm writing out to a sequence file, so accumulo is only being used for
input.  As a side note, at one point during my troubleshooting, I
compacted, cloned, and then took the cloned table offline and tried to use
that instead.  That failed immediately, without processing any records.
 From the stacktrace, it appeared as though the iterator was trying to use
one of the files for the original table (from what I understand about
cloning, this is normal b/c no changes had been made to the original table)
but said it did not exist.  I was, however, able to find the file on hdfs.
 So I just gave up on that.  Also, under the normal case, ie using the
original table online, nothing is writing to the table while the hadoop job
is running.

4) My original open file limit was the os default of 256.  So I upped it to
1024, and performed another attempt.  The behavior was the same as before.

I'm including a snippet from the tserver debug log.  It looks like an
expired session might be the root of the problem, but I'm not sure what
would cause that:
16 13:47:26,284 [tabletserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:49730 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
16 13:47:26,619 [tabletserver.TabletServer] DEBUG: UpSess
127.0.0.1:4971945 in 0.006s, at=[0 0 0.00 1] ft=0.004s(pt=0.001s
lt=0.002s ct=0.001s)
16 13:47:31,317 [tabletserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:49740 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
16 13:47:36,350 [tabletserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:49750 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
16 13:47:41,377 [tabletserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:49760 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
16 13:47:46,278 [tabletserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:49771 !0 4 entries in 0.02 secs, nbTimes = [17 17 17.00 1]
16 13:47:46,305 [tabletserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:49771 !0 2 entries in 0.01 secs, nbTimes = [14 14 14.00 1]
16 13:47:46,406 [tabletserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:49782 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
16 13:47:47,487 [tabletserver.TabletServer] DEBUG: gc ParNew=0.07(+0.01)
secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=111,234,160(+14,477,728)
totalmem=132,055,040
16 13:47:51,452 [tabletserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
16 13:47:51,462 [tabletserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:50970 !0 8 entries in 0.00 secs, nbTimes = [3 3 3.00 1]
16 13:47:51,474 [tabletserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:50977 !0 0 entries in 0.01 secs, nbTimes = [2 2 2.00 1]
16 13:47:51,477 [tabletserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:50970 !0 26 entries in 0.01 secs, nbTimes = [11 11 11.00 1]
16 13:47:51,494 [tabletserver.LargestFirstMemoryManager] DEBUG: IDLE minor
compaction chosen
16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG: COMPACTING
!0<;~  total = 120,840 ingestMemory = 120,840
16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG: chosenMem =
2,416 chosenIT = 300.23 load 3,044
16 13:47:51,498 [tabletserver.Tablet] DEBUG: MinC initiate lock 0.00 secs
16 13:47:51,502 [tabletserver.MinorCompactor] DEBUG: Begin minor compaction
/accumulo/tables/!0/default_tablet/F000051i.rf_tmp !0<;~
16 13:47:51,525 [tabletserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
16 13:47:51,532 [tabletserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [3 3 3.00 1]
16 13:47:51,538 [tabletserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: BEFORE
compactionThreshold = 0.605 maxObserved = 120,840
16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER
compactionThreshold = 0.666
16 13:47:51,942 [tabletserver.Compactor] DEBUG: Compaction !0<;~ 24 read |
4 written |  4,800 entries/sec |  0.005 secs
16 13:47:51,950 [tabletserver.Tablet] DEBUG: Logs for memory compacted:
!0<;~ 127.0.0.1:11224/770e1b91-351a-45ac-8992-c0fb602ac51c
16 13:47:51,956 [log.TabletServerLogger] DEBUG:  wrote MinC finish  35:
writeTime:1ms
16 13:47:51,956 [tabletserver.Tablet] TABLET_HIST: !0<;~ MinC [memory] ->
/default_tablet/F000051i.rf
16 13:47:51,957 [tabletserver.Tablet] DEBUG: MinC finish lock 0.00 secs
!0<;~
16 13:47:52,650 [tabletserver.TabletServer] DEBUG: UpSess
127.0.0.1:5148043 in 0.009s, at=[0 0 0.00 1] ft=0.005s(pt=0.001s
lt=0.003s ct=0.001s)
16 13:47:53,233 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
Total=0.0769043MB (80640), Free=19.923096MB (20890880), Max=20.0MB
(20971520), Counts: Blocks=33, Access=43, Hit=10, Miss=33, Evictions=0,
Evicted=0, Ratios: Hit Ratio=23.255814611911774%, Miss
Ratio=76.74418687820435%, Evicted/Run=NaN, Duplicate Reads=0
16 13:47:53,254 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
Total=0.0128479MB (13472), Free=6.987152MB (7326560), Max=7.0MB (7340032),
Counts: Blocks=20, Access=127, Hit=106, Miss=21, Evictions=0, Evicted=0,
Ratios: Hit Ratio=83.4645688533783%, Miss Ratio=16.535432636737823%,
Evicted/Run=NaN, Duplicate Reads=1
16 13:47:53,491 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00 secs,
wait 0.00 secs
16 13:47:53,493 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00 secs,
wait 0.00 secs
16 13:47:53,494 [tabletserver.Tablet] DEBUG: Starting MajC !0;~;!0<
(NORMAL) [/table_info/A000051d.rf, /table_info/F000051h.rf] -->
/table_info/A000051j.rf_tmp
16 13:47:53,495 [tabletserver.Tablet] DEBUG: Starting MajC !0;!0<< (NORMAL)
[/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] -->
/root_tablet/A000051k.rf_tmp
16 13:47:53,683 [tabletserver.Compactor] DEBUG: Compaction !0;!0<< 39 read
| 21 written |    276 entries/sec |  0.141 secs
16 13:47:53,701 [tabletserver.Compactor] DEBUG: Compaction !0;~;!0< 56 read
| 51 written |    323 entries/sec |  0.173 secs
16 13:47:53,708 [tabletserver.Tablet] DEBUG: MajC finish lock 0.02 secs
16 13:47:53,708 [tabletserver.Tablet] TABLET_HIST: !0;!0<< MajC
[/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] -->
/root_tablet/A000051k.rf
16 13:47:53,712 [tabletserver.TabletServer] INFO : Adding 1 logs for extent
!0<;~ as alias 2
16 13:47:53,726 [tabletserver.Tablet] DEBUG: MajC finish lock 0.00 secs
16 13:47:53,726 [tabletserver.Tablet] TABLET_HIST: !0;~;!0< MajC
[/table_info/A000051d.rf, /table_info/F000051h.rf] -->
/table_info/A000051j.rf
16 13:47:54,497 [tabletserver.TabletServer] DEBUG: gc ParNew=0.09(+0.02)
secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=97,183,880(-1,627,536)
totalmem=132,055,040
16 13:47:54,694 [tabletserver.TabletServer] DEBUG: UpSess
127.0.0.1:51480146 in 0.034s, at=[0 0 0.00 1] ft=0.013s(pt=0.001s
lt=0.008s ct=0.004s)
16 13:47:56,527 [tabletserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:53005 !0 0 entries in 0.02 secs, nbTimes = [15 15 15.00 1]
16 13:48:01,503 [tabletserver.TabletServer] DEBUG: gc ParNew=0.09(+0.01)
secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=98,415,928(+12,106,904)
totalmem=132,055,040
16 13:48:01,576 [tabletserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:55688 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess
127.0.0.1:65352 2 entries in 0.03 secs (lookup_time:0.03 secs tablets:1
ranges:1)
16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess
127.0.0.1:65358 4 entries in 0.03 secs (lookup_time:0.03 secs tablets:1
ranges:1)
16 13:48:06,620 [tabletserver.TabletServer] DEBUG: ScanSess tid
127.0.0.1:58247 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
16 13:48:44,595 [zookeeper.ZooLock] DEBUG: event null None Disconnected
16 13:48:44,927 [zookeeper.ZooLock] DEBUG: event null None Expired
16 13:48:44,943 [tabletserver.TabletServer] FATAL: Lost tablet server lock
(reason = SESSION_EXPIRED), exiting.


I'm also including a snippet from the zookeeper log as well.  Around the
same the tserver logs show the loss of the tablet server lock due to
expired session, the zookeeper logs mentions the expiration of several
sessions due to timeout.
2012-08-16 13:48:07,368 - INFO  [Thread-370:NIOServerCnxn$StatCommand@1153]
- Stat command output
2012-08-16 13:48:07,368 - INFO  [Thread-370:NIOServerCnxn@1435] - Closed
socket connection for client /127.0.0.1:58633 (no session established for
client)
2012-08-16 13:48:08,371 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket
connection from /127.0.0.1:59158
2012-08-16 13:48:08,372 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing stat command from /
127.0.0.1:59158
2012-08-16 13:48:08,378 - INFO  [Thread-371:NIOServerCnxn$StatCommand@1153]
- Stat command output
2012-08-16 13:48:08,379 - INFO  [Thread-371:NIOServerCnxn@1435] - Closed
socket connection for client /127.0.0.1:59158 (no session established for
client)
2012-08-16 13:48:09,395 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket
connection from /127.0.0.1:59635
2012-08-16 13:48:09,395 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing stat command from /
127.0.0.1:59635
2012-08-16 13:48:09,399 - INFO  [Thread-372:NIOServerCnxn$StatCommand@1153]
- Stat command output
2012-08-16 13:48:09,400 - INFO  [Thread-372:NIOServerCnxn@1435] - Closed
socket connection for client /127.0.0.1:59635 (no session established for
client)
2012-08-16 13:48:32,000 - INFO  [SessionTracker:ZooKeeperServer@316] -
Expiring session 0x1393086dfb80005, timeout of 30000ms exceeded
2012-08-16 13:48:32,000 - INFO  [SessionTracker:ZooKeeperServer@316] -
Expiring session 0x1393086dfb80004, timeout of 30000ms exceeded
2012-08-16 13:48:32,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
Expiring session 0x1393086dfb80001, timeout of 30000ms exceeded
2012-08-16 13:48:32,001 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
- Processed session termination for sessionid: 0x1393086dfb80005
2012-08-16 13:48:32,001 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
- Processed session termination for sessionid: 0x1393086dfb80004
2012-08-16 13:48:32,001 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
- Processed session termination for sessionid: 0x1393086dfb80001
2012-08-16 13:48:32,003 - INFO  [SyncThread:0:NIOServerCnxn@1435] - Closed
socket connection for client /127.0.0.1:65339 which had sessionid
0x1393086dfb80005
2012-08-16 13:48:34,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
Expiring session 0x1393086dfb80002, timeout of 30000ms exceeded
2012-08-16 13:48:36,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
Expiring session 0x1393086dfb80007, timeout of 30000ms exceeded
2012-08-16 13:48:38,000 - INFO  [SessionTracker:ZooKeeperServer@316] -
Expiring session 0x1393086dfb8004d, timeout of 30000ms exceeded
2012-08-16 13:48:38,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
Expiring session 0x1393086dfb80006, timeout of 30000ms exceeded
2012-08-16 13:48:40,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
Expiring session 0x1393086dfb80003, timeout of 30000ms exceeded
2012-08-16 13:48:44,456 - INFO  [SyncThread:0:NIOServerCnxn@1435] - Closed
socket connection for client /fe80:0:0:0:0:0:0:1%1:65332 which had
sessionid 0x1393086dfb80004
2012-08-16 13:48:44,458 - INFO  [SyncThread:0:NIOServerCnxn@1435] - Closed
socket connection for client /127.0.0.1:65316 which had sessionid
0x1393086dfb80001
2012-08-16 13:48:44,461 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
- Processed session termination for sessionid: 0x1393086dfb80002
2012-08-16 13:48:44,482 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
- Processed session termination for sessionid: 0x1393086dfb80007
2012-08-16 13:48:44,484 - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@639] - Exception causing close of
session 0x1393086dfb80003 due to java.io.IOException: Connection reset by
peer
2012-08-16 13:48:44,486 - INFO  [SyncThread:0:NIOServerCnxn@1435] - Closed
socket connection for client /fe80:0:0:0:0:0:0:1%1:65317 which had
sessionid 0x1393086dfb80002
2012-08-16 13:48:44,491 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
- Processed session termination for sessionid: 0x1393086dfb8004d
2012-08-16 13:48:44,493 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
- Processed session termination for sessionid: 0x1393086dfb80006
2012-08-16 13:48:44,495 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
client /fe80:0:0:0:0:0:0:1%1:65330 which had sessionid 0x1393086dfb80003
2012-08-16 13:48:44,495 - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to
read additional data from client sessionid 0x1393086dfb80007, likely client
has closed socket
2012-08-16 13:48:44,496 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
client /0:0:0:0:0:0:0:1%0:65343 which had sessionid 0x1393086dfb80007
2012-08-16 13:48:44,496 - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to
read additional data from client sessionid 0x1393086dfb8004d, likely client
has closed socket
2012-08-16 13:48:44,497 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
client /127.0.0.1:49770 which had sessionid 0x1393086dfb8004d
2012-08-16 13:48:44,498 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
- Processed session termination for sessionid: 0x1393086dfb80003
2012-08-16 13:48:44,500 - INFO  [SyncThread:0:NIOServerCnxn@1435] - Closed
socket connection for client /fe80:0:0:0:0:0:0:1%1:65341 which had
sessionid 0x1393086dfb80006
2012-08-16 13:48:44,613 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket
connection from /0:0:0:0:0:0:0:1%0:59789
2012-08-16 13:48:44,615 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew
session 0x1393086dfb80001 at /0:0:0:0:0:0:0:1%0:59789
2012-08-16 13:48:44,616 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1573] - Invalid session
0x1393086dfb80001 for client /0:0:0:0:0:0:0:1%0:59789, probably expired
2012-08-16 13:48:44,616 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
client /0:0:0:0:0:0:0:1%0:59789 which had sessionid 0x1393086dfb80001


Sorry for the huge post of log drivel, but if you'd like to see more I can
post (or via pastebin) the full logs.  I made sure to shut down everything,
wipe out the logs, then start everything back up so the logs would be as
clean as possible.


I looked for a way to increase the session timeout on the
AccumuloInputFormat, but there doesn't seem to be one.  I'm sure that's not
the right fix anyway, but figured I'd mention it.


Do you guys have any other suggestions.  I'm sure this is due to some
configuration error on my part, but I'm running out of idea.


Thanks a ton,
-Arjumand


On Thu, Aug 16, 2012 at 7:24 AM, William Slacum <
wilhelm.von.cloud@accumulo.net> wrote:

> What does your TServer debug log say? Also, are you writing back out to
> Accumulo?
>
> To follow up what Jim said, you can check the zookeeper log to see if max
> connections is being hit. You may also want to check and see what your max
> xceivers is set to for HDFS and check your Accumulo and HDFS logs to see if
> it is mentioned.
>
>
> On Thu, Aug 16, 2012 at 3:59 AM, Arjumand Bonhomme <jumand@gmail.com>wrote:
>
>> Hello,
>>
>> I'm fairly new to both Accumulo and Hadoop, so I think my problem may be
>> due to poor configuration on my part, but I'm running out of ideas.
>>
>> I'm running this on a mac laptop, with hadoop (hadoop-0.20.2 from cdh3u4)
>> in pseudo-distributed mode.
>> zookeeper version zookeeper-3.3.5 from cdh3u4
>> I'm using the 1.4.1 release of accumulo with a configuration copied from
>> "conf/examples/512MB/standalone"
>>
>> I've got a Map task that is using an accumulo table as the input.
>> I'm fetching all rows, but just a single column family, that has hundreds
>> or even thousands of different column qualifiers.
>> The table has a SummingCombiner installed for the given the column family.
>>
>> The task runs fine at first, but after ~9-15K records (I print the record
>> count to the console every 1K records), it hangs and the following messages
>> are printed to the console where I'm running the job:
>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read additional
>> data from server sessionid 0x1392cc35b460d1c, likely server has closed
>> socket, closing socket connection and attempting reconnect
>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket connection to
>> server localhost/fe80:0:0:0:0:0:0:1%1:2181
>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Socket connection
>> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to
>> ZooKeeper service, session 0x1392cc35b460d1c has expired, closing socket
>> connection
>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread shut down
>> 12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client connection,
>> connectString=localhost sessionTimeout=30000
>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c
>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket connection to
>> server localhost/0:0:0:0:0:0:0:1:2181
>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection
>> established to localhost/0:0:0:0:0:0:0:1:2181, initiating session
>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment
>> complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid =
>> 0x1392cc35b460d25, negotiated timeout = 30000
>> 12/08/16 02:57:11 INFO mapred.LocalJobRunner:
>> 12/08/16 02:57:14 INFO mapred.LocalJobRunner:
>> 12/08/16 02:57:17 INFO mapred.LocalJobRunner:
>>
>> Sometimes the messages contain a stacktrace like this below:
>> 12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session 0x1392cc35b460b40
>> for server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected error, closing
>> socket connection and attempting reconnect
>> java.io.IOException: Connection reset by peer
>>  at sun.nio.ch.FileDispatcher.read0(Native Method)
>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>>  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
>> at sun.nio.ch.IOUtil.read(IOUtil.java:166)
>>  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
>> at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856)
>>  at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154)
>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket connection to
>> server localhost/127.0.0.1:2181
>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection
>> established to localhost/127.0.0.1:2181, initiating session
>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unable to reconnect to
>> ZooKeeper service, session 0x1392cc35b460b40 has expired, closing socket
>> connection
>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: EventThread shut down
>> 12/08/16 01:57:41 INFO zookeeper.ZooKeeper: Initiating client connection,
>> connectString=localhost sessionTimeout=30000
>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@684a26e8
>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Opening socket connection to
>> server localhost/fe80:0:0:0:0:0:0:1%1:2181
>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Socket connection
>> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Session establishment
>> complete on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid =
>> 0x1392cc35b460b46, negotiated timeout = 30000
>>
>>
>> I've poked through the logs in accumulo, and I've noticed that when it
>> hangs, the following is written to the "logger_HOSTNAME.debug.log" file:
>> 16 03:29:46,332 [logger.LogService] DEBUG: event null None Disconnected
>> 16 03:29:47,248 [zookeeper.ZooSession] DEBUG: Session expired, state of
>> current session : Expired
>> 16 03:29:47,248 [logger.LogService] DEBUG: event null None Expired
>> 16 03:29:47,249 [logger.LogService] WARN : Logger lost zookeeper
>> registration at null
>> 16 03:29:47,452 [logger.LogService] INFO : Logger shutting down
>> 16 03:29:47,453 [logger.LogWriter] INFO : Shutting down
>>
>>
>> I've noticed that if I make the map task print out the record count more
>> frequently (ie every 10 records), it seems to be able get through more
>> records than when I only print every 1K records. My assumption was that
>> this had something to do with more time being spent in the map task, and
>> not fetching data from accumulo.  There was at least one occasion where I
>> printed to the console for every record, and in that situation it managed
>> to process 47K records, although I have been unable to repeat that behavior.
>>
>> I've also noticed that if I stop and start accumulo, the map-reduce job
>> will pickup where it left off, but seems to fail quicker.
>>
>>
>>
>> Could someone make some suggestions as to what my problem might be? It
>> would be greatly appreciated.  If you need any additional information from
>> me, just let me know.  I'd paste my config files, driver setup, and example
>> data into this post, but I think it's probably long enough already.
>>
>>
>> Thanks in advance,
>> -Arjumand
>>
>>
>

Mime
View raw message