accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arjumand Bonhomme <jum...@gmail.com>
Subject Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection
Date Thu, 16 Aug 2012 19:48:24 GMT
While this isn't my original concern/problem, so it's not terribly
important, but I'm not sure I understand and I'd like to learn as much as
possible.

Why wouldn't it work with the cloned table offline?
I followed the example laid out in
org.apache.accumulo.examples.simple.mapreduce.UniqueColumns
Using the "AccumuloInputFormat.setScanOffline(job.getConfiguration(),
true);" setup.

Can you give me some detail as to why it definitely would not have worked
offline?  Is this method no longer supported.

Thanks!
-Arjumand

On Thu, Aug 16, 2012 at 3:24 PM, John Vines <vines@apache.org> wrote:

> You're work ont he clone table failed because you took it offline. The
> table needs to be online in order to do a job against it.
>
> John
>
>
> On Thu, Aug 16, 2012 at 2:36 PM, Arjumand Bonhomme <jumand@gmail.com>wrote:
>
>> Jim, William, Adam: Thanks for your help with this, I'm running out of
>> ideas to google search for answers.
>>
>> I'm going to try to answer your questions:
>>
>> 1) I had already adjusted the maxclientcxns in zoo.cfg to be 100.  I
>> didn't see anything in log indicating that it was shutting down connections
>> due to reaching a connection limit.  However, throughout the zookeeper
>> logs, even before the hadoop job was run.  I did see lots of lines like
>> these.  Up til this point, I've assumed it's innocuous and unrelated to my
>> issue.
>> INFO  [Thread-371:NIOServerCnxn@1435] - Closed socket connection for
>> client /127.0.0.1:59158 (no session established for client)
>>
>> 2) I had also already adjusted the dfs.datanode.max.xcievers property to
>> 4096 in hdfs-site.xml.  An investigation of the log shows that this limit
>> is not being reached.  It had been being reached at some point before I
>> increased the value from the default.  I had increased the value while
>> troubleshooting, but prior to posting to this list; it didn't appear to
>> have a noticeable affect on the behavior of the hadoop job.
>>
>> 3) I'm writing out to a sequence file, so accumulo is only being used for
>> input.  As a side note, at one point during my troubleshooting, I
>> compacted, cloned, and then took the cloned table offline and tried to use
>> that instead.  That failed immediately, without processing any records.
>>  From the stacktrace, it appeared as though the iterator was trying to use
>> one of the files for the original table (from what I understand about
>> cloning, this is normal b/c no changes had been made to the original table)
>> but said it did not exist.  I was, however, able to find the file on hdfs.
>>  So I just gave up on that.  Also, under the normal case, ie using the
>> original table online, nothing is writing to the table while the hadoop job
>> is running.
>>
>> 4) My original open file limit was the os default of 256.  So I upped it
>> to 1024, and performed another attempt.  The behavior was the same as
>> before.
>>
>> I'm including a snippet from the tserver debug log.  It looks like an
>> expired session might be the root of the problem, but I'm not sure what
>> would cause that:
>> 16 13:47:26,284 [tabletserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:49730 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>> 16 13:47:26,619 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:4971945 in 0.006s,
at=[0 0 0.00 1] ft=0.004s(pt=0.001s lt=0.002s ct=0.001s)
>> 16 13:47:31,317 [tabletserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:49740 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>> 16 13:47:36,350 [tabletserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:49750 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>> 16 13:47:41,377 [tabletserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:49760 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>> 16 13:47:46,278 [tabletserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:49771 !0 4 entries in 0.02 secs, nbTimes = [17 17 17.00 1]
>> 16 13:47:46,305 [tabletserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:49771 !0 2 entries in 0.01 secs, nbTimes = [14 14 14.00 1]
>> 16 13:47:46,406 [tabletserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:49782 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>> 16 13:47:47,487 [tabletserver.TabletServer] DEBUG: gc ParNew=0.07(+0.01)
>> secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=111,234,160(+14,477,728)
>> totalmem=132,055,040
>> 16 13:47:51,452 [tabletserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
>> 16 13:47:51,462 [tabletserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:50970 !0 8 entries in 0.00 secs, nbTimes = [3 3 3.00 1]
>> 16 13:47:51,474 [tabletserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:50977 !0 0 entries in 0.01 secs, nbTimes = [2 2 2.00 1]
>> 16 13:47:51,477 [tabletserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:50970 !0 26 entries in 0.01 secs, nbTimes = [11 11 11.00 1]
>> 16 13:47:51,494 [tabletserver.LargestFirstMemoryManager] DEBUG: IDLE
>> minor compaction chosen
>> 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG:
>> COMPACTING !0<;~  total = 120,840 ingestMemory = 120,840
>> 16 13:47:51,495 [tabletserver.LargestFirstMemoryManager] DEBUG: chosenMem
>> = 2,416 chosenIT = 300.23 load 3,044
>> 16 13:47:51,498 [tabletserver.Tablet] DEBUG: MinC initiate lock 0.00 secs
>> 16 13:47:51,502 [tabletserver.MinorCompactor] DEBUG: Begin minor
>> compaction /accumulo/tables/!0/default_tablet/F000051i.rf_tmp !0<;~
>> 16 13:47:51,525 [tabletserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
>> 16 13:47:51,532 [tabletserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:50970 !0 1 entries in 0.00 secs, nbTimes = [3 3 3.00 1]
>> 16 13:47:51,538 [tabletserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:50970 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
>> 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: BEFORE
>> compactionThreshold = 0.605 maxObserved = 120,840
>> 16 13:47:51,750 [tabletserver.LargestFirstMemoryManager] DEBUG: AFTER
>> compactionThreshold = 0.666
>> 16 13:47:51,942 [tabletserver.Compactor] DEBUG: Compaction !0<;~ 24 read
>> | 4 written |  4,800 entries/sec |  0.005 secs
>> 16 13:47:51,950 [tabletserver.Tablet] DEBUG: Logs for memory compacted:
>> !0<;~ 127.0.0.1:11224/770e1b91-351a-45ac-8992-c0fb602ac51c
>>  16 13:47:51,956 [log.TabletServerLogger] DEBUG:  wrote MinC finish  35:
>> writeTime:1ms
>> 16 13:47:51,956 [tabletserver.Tablet] TABLET_HIST: !0<;~ MinC [memory] ->
>> /default_tablet/F000051i.rf
>> 16 13:47:51,957 [tabletserver.Tablet] DEBUG: MinC finish lock 0.00 secs
>> !0<;~
>> 16 13:47:52,650 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:5148043 in 0.009s,
at=[0 0 0.00 1] ft=0.005s(pt=0.001s lt=0.003s ct=0.001s)
>> 16 13:47:53,233 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
>> Total=0.0769043MB (80640), Free=19.923096MB (20890880), Max=20.0MB
>> (20971520), Counts: Blocks=33, Access=43, Hit=10, Miss=33, Evictions=0,
>> Evicted=0, Ratios: Hit Ratio=23.255814611911774%, Miss
>> Ratio=76.74418687820435%, Evicted/Run=NaN, Duplicate Reads=0
>> 16 13:47:53,254 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes:
>> Total=0.0128479MB (13472), Free=6.987152MB (7326560), Max=7.0MB (7340032),
>> Counts: Blocks=20, Access=127, Hit=106, Miss=21, Evictions=0, Evicted=0,
>> Ratios: Hit Ratio=83.4645688533783%, Miss Ratio=16.535432636737823%,
>> Evicted/Run=NaN, Duplicate Reads=1
>> 16 13:47:53,491 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00
>> secs, wait 0.00 secs
>> 16 13:47:53,493 [tabletserver.Tablet] DEBUG: MajC initiate lock 0.00
>> secs, wait 0.00 secs
>> 16 13:47:53,494 [tabletserver.Tablet] DEBUG: Starting MajC !0;~;!0<
>> (NORMAL) [/table_info/A000051d.rf, /table_info/F000051h.rf] -->
>> /table_info/A000051j.rf_tmp
>> 16 13:47:53,495 [tabletserver.Tablet] DEBUG: Starting MajC !0;!0<<
>> (NORMAL) [/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] -->
>> /root_tablet/A000051k.rf_tmp
>> 16 13:47:53,683 [tabletserver.Compactor] DEBUG: Compaction !0;!0<< 39
>> read | 21 written |    276 entries/sec |  0.141 secs
>> 16 13:47:53,701 [tabletserver.Compactor] DEBUG: Compaction !0;~;!0< 56
>> read | 51 written |    323 entries/sec |  0.173 secs
>> 16 13:47:53,708 [tabletserver.Tablet] DEBUG: MajC finish lock 0.02 secs
>> 16 13:47:53,708 [tabletserver.Tablet] TABLET_HIST: !0;!0<< MajC
>> [/root_tablet/A000051e.rf, /root_tablet/F000051g.rf] -->
>> /root_tablet/A000051k.rf
>> 16 13:47:53,712 [tabletserver.TabletServer] INFO : Adding 1 logs for
>> extent !0<;~ as alias 2
>> 16 13:47:53,726 [tabletserver.Tablet] DEBUG: MajC finish lock 0.00 secs
>> 16 13:47:53,726 [tabletserver.Tablet] TABLET_HIST: !0;~;!0< MajC
>> [/table_info/A000051d.rf, /table_info/F000051h.rf] -->
>> /table_info/A000051j.rf
>> 16 13:47:54,497 [tabletserver.TabletServer] DEBUG: gc ParNew=0.09(+0.02)
>> secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=97,183,880(-1,627,536)
>> totalmem=132,055,040
>> 16 13:47:54,694 [tabletserver.TabletServer] DEBUG: UpSess 127.0.0.1:51480146 in 0.034s,
at=[0 0 0.00 1] ft=0.013s(pt=0.001s lt=0.008s ct=0.004s)
>> 16 13:47:56,527 [tabletserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:53005 !0 0 entries in 0.02 secs, nbTimes = [15 15 15.00 1]
>> 16 13:48:01,503 [tabletserver.TabletServer] DEBUG: gc ParNew=0.09(+0.01)
>> secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=98,415,928(+12,106,904)
>> totalmem=132,055,040
>> 16 13:48:01,576 [tabletserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:55688 !0 0 entries in 0.00 secs, nbTimes = [2 2 2.00 1]
>> 16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess
>> 127.0.0.1:65352 2 entries in 0.03 secs (lookup_time:0.03 secs tablets:1
>> ranges:1)
>> 16 13:48:01,663 [tabletserver.TabletServer] DEBUG: MultiScanSess
>> 127.0.0.1:65358 4 entries in 0.03 secs (lookup_time:0.03 secs tablets:1
>> ranges:1)
>> 16 13:48:06,620 [tabletserver.TabletServer] DEBUG: ScanSess tid
>> 127.0.0.1:58247 !0 0 entries in 0.00 secs, nbTimes = [1 1 1.00 1]
>> 16 13:48:44,595 [zookeeper.ZooLock] DEBUG: event null None Disconnected
>> 16 13:48:44,927 [zookeeper.ZooLock] DEBUG: event null None Expired
>> 16 13:48:44,943 [tabletserver.TabletServer] FATAL: Lost tablet server
>> lock (reason = SESSION_EXPIRED), exiting.
>>
>>
>> I'm also including a snippet from the zookeeper log as well.  Around the
>> same the tserver logs show the loss of the tablet server lock due to
>> expired session, the zookeeper logs mentions the expiration of several
>> sessions due to timeout.
>> 2012-08-16 13:48:07,368 - INFO  [Thread-370:NIOServerCnxn$StatCommand@1153]
>> - Stat command output
>> 2012-08-16 13:48:07,368 - INFO  [Thread-370:NIOServerCnxn@1435] - Closed
>> socket connection for client /127.0.0.1:58633 (no session established
>> for client)
>> 2012-08-16 13:48:08,371 - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket
>> connection from /127.0.0.1:59158
>> 2012-08-16 13:48:08,372 - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing stat command from /
>> 127.0.0.1:59158
>> 2012-08-16 13:48:08,378 - INFO  [Thread-371:NIOServerCnxn$StatCommand@1153]
>> - Stat command output
>> 2012-08-16 13:48:08,379 - INFO  [Thread-371:NIOServerCnxn@1435] - Closed
>> socket connection for client /127.0.0.1:59158 (no session established
>> for client)
>> 2012-08-16 13:48:09,395 - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket
>> connection from /127.0.0.1:59635
>> 2012-08-16 13:48:09,395 - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1237] - Processing stat command from /
>> 127.0.0.1:59635
>> 2012-08-16 13:48:09,399 - INFO  [Thread-372:NIOServerCnxn$StatCommand@1153]
>> - Stat command output
>> 2012-08-16 13:48:09,400 - INFO  [Thread-372:NIOServerCnxn@1435] - Closed
>> socket connection for client /127.0.0.1:59635 (no session established
>> for client)
>> 2012-08-16 13:48:32,000 - INFO  [SessionTracker:ZooKeeperServer@316] -
>> Expiring session 0x1393086dfb80005, timeout of 30000ms exceeded
>> 2012-08-16 13:48:32,000 - INFO  [SessionTracker:ZooKeeperServer@316] -
>> Expiring session 0x1393086dfb80004, timeout of 30000ms exceeded
>> 2012-08-16 13:48:32,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
>> Expiring session 0x1393086dfb80001, timeout of 30000ms exceeded
>> 2012-08-16 13:48:32,001 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
>> - Processed session termination for sessionid: 0x1393086dfb80005
>> 2012-08-16 13:48:32,001 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
>> - Processed session termination for sessionid: 0x1393086dfb80004
>> 2012-08-16 13:48:32,001 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
>> - Processed session termination for sessionid: 0x1393086dfb80001
>> 2012-08-16 13:48:32,003 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
>> Closed socket connection for client /127.0.0.1:65339 which had sessionid
>> 0x1393086dfb80005
>> 2012-08-16 13:48:34,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
>> Expiring session 0x1393086dfb80002, timeout of 30000ms exceeded
>> 2012-08-16 13:48:36,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
>> Expiring session 0x1393086dfb80007, timeout of 30000ms exceeded
>> 2012-08-16 13:48:38,000 - INFO  [SessionTracker:ZooKeeperServer@316] -
>> Expiring session 0x1393086dfb8004d, timeout of 30000ms exceeded
>> 2012-08-16 13:48:38,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
>> Expiring session 0x1393086dfb80006, timeout of 30000ms exceeded
>> 2012-08-16 13:48:40,001 - INFO  [SessionTracker:ZooKeeperServer@316] -
>> Expiring session 0x1393086dfb80003, timeout of 30000ms exceeded
>> 2012-08-16 13:48:44,456 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
>> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65332 which had
>> sessionid 0x1393086dfb80004
>> 2012-08-16 13:48:44,458 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
>> Closed socket connection for client /127.0.0.1:65316 which had sessionid
>> 0x1393086dfb80001
>> 2012-08-16 13:48:44,461 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
>> - Processed session termination for sessionid: 0x1393086dfb80002
>> 2012-08-16 13:48:44,482 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
>> - Processed session termination for sessionid: 0x1393086dfb80007
>> 2012-08-16 13:48:44,484 - WARN  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@639] - Exception causing close of
>> session 0x1393086dfb80003 due to java.io.IOException: Connection reset by
>> peer
>> 2012-08-16 13:48:44,486 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
>> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65317 which had
>> sessionid 0x1393086dfb80002
>> 2012-08-16 13:48:44,491 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
>> - Processed session termination for sessionid: 0x1393086dfb8004d
>> 2012-08-16 13:48:44,493 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
>> - Processed session termination for sessionid: 0x1393086dfb80006
>> 2012-08-16 13:48:44,495 - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
>> client /fe80:0:0:0:0:0:0:1%1:65330 which had sessionid 0x1393086dfb80003
>> 2012-08-16 13:48:44,495 - WARN  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable
>> to read additional data from client sessionid 0x1393086dfb80007, likely
>> client has closed socket
>> 2012-08-16 13:48:44,496 - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
>> client /0:0:0:0:0:0:0:1%0:65343 which had sessionid 0x1393086dfb80007
>> 2012-08-16 13:48:44,496 - WARN  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable
>> to read additional data from client sessionid 0x1393086dfb8004d, likely
>> client has closed socket
>> 2012-08-16 13:48:44,497 - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
>> client /127.0.0.1:49770 which had sessionid 0x1393086dfb8004d
>> 2012-08-16 13:48:44,498 - INFO  [ProcessThread:-1:PrepRequestProcessor@399]
>> - Processed session termination for sessionid: 0x1393086dfb80003
>> 2012-08-16 13:48:44,500 - INFO  [SyncThread:0:NIOServerCnxn@1435] -
>> Closed socket connection for client /fe80:0:0:0:0:0:0:1%1:65341 which had
>> sessionid 0x1393086dfb80006
>> 2012-08-16 13:48:44,613 - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket
>> connection from /0:0:0:0:0:0:0:1%0:59789
>> 2012-08-16 13:48:44,615 - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew
>> session 0x1393086dfb80001 at /0:0:0:0:0:0:0:1%0:59789
>> 2012-08-16 13:48:44,616 - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1573] - Invalid session
>> 0x1393086dfb80001 for client /0:0:0:0:0:0:0:1%0:59789, probably expired
>> 2012-08-16 13:48:44,616 - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
>> client /0:0:0:0:0:0:0:1%0:59789 which had sessionid 0x1393086dfb80001
>>
>>
>> Sorry for the huge post of log drivel, but if you'd like to see more I
>> can post (or via pastebin) the full logs.  I made sure to shut down
>> everything, wipe out the logs, then start everything back up so the logs
>> would be as clean as possible.
>>
>>
>> I looked for a way to increase the session timeout on the
>> AccumuloInputFormat, but there doesn't seem to be one.  I'm sure that's not
>> the right fix anyway, but figured I'd mention it.
>>
>>
>> Do you guys have any other suggestions.  I'm sure this is due to some
>> configuration error on my part, but I'm running out of idea.
>>
>>
>> Thanks a ton,
>> -Arjumand
>>
>>
>>
>> On Thu, Aug 16, 2012 at 7:24 AM, William Slacum <
>> wilhelm.von.cloud@accumulo.net> wrote:
>>
>>> What does your TServer debug log say? Also, are you writing back out to
>>> Accumulo?
>>>
>>> To follow up what Jim said, you can check the zookeeper log to see if
>>> max connections is being hit. You may also want to check and see what your
>>> max xceivers is set to for HDFS and check your Accumulo and HDFS logs to
>>> see if it is mentioned.
>>>
>>>
>>> On Thu, Aug 16, 2012 at 3:59 AM, Arjumand Bonhomme <jumand@gmail.com>wrote:
>>>
>>>> Hello,
>>>>
>>>> I'm fairly new to both Accumulo and Hadoop, so I think my problem may
>>>> be due to poor configuration on my part, but I'm running out of ideas.
>>>>
>>>> I'm running this on a mac laptop, with hadoop (hadoop-0.20.2 from
>>>> cdh3u4) in pseudo-distributed mode.
>>>> zookeeper version zookeeper-3.3.5 from cdh3u4
>>>> I'm using the 1.4.1 release of accumulo with a configuration copied
>>>> from "conf/examples/512MB/standalone"
>>>>
>>>> I've got a Map task that is using an accumulo table as the input.
>>>> I'm fetching all rows, but just a single column family, that has
>>>> hundreds or even thousands of different column qualifiers.
>>>> The table has a SummingCombiner installed for the given the column
>>>> family.
>>>>
>>>> The task runs fine at first, but after ~9-15K records (I print the
>>>> record count to the console every 1K records), it hangs and the following
>>>> messages are printed to the console where I'm running the job:
>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to read additional
>>>> data from server sessionid 0x1392cc35b460d1c, likely server has closed
>>>> socket, closing socket connection and attempting reconnect
>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Opening socket connection
>>>> to server localhost/fe80:0:0:0:0:0:0:1%1:2181
>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Socket connection
>>>> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: Unable to reconnect to
>>>> ZooKeeper service, session 0x1392cc35b460d1c has expired, closing socket
>>>> connection
>>>> 12/08/16 02:57:08 INFO zookeeper.ClientCnxn: EventThread shut down
>>>> 12/08/16 02:57:10 INFO zookeeper.ZooKeeper: Initiating client
>>>> connection, connectString=localhost sessionTimeout=30000
>>>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@32f5c51c
>>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Opening socket connection
>>>> to server localhost/0:0:0:0:0:0:0:1:2181
>>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Socket connection
>>>> established to localhost/0:0:0:0:0:0:0:1:2181, initiating session
>>>> 12/08/16 02:57:10 INFO zookeeper.ClientCnxn: Session establishment
>>>> complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid =
>>>> 0x1392cc35b460d25, negotiated timeout = 30000
>>>> 12/08/16 02:57:11 INFO mapred.LocalJobRunner:
>>>> 12/08/16 02:57:14 INFO mapred.LocalJobRunner:
>>>> 12/08/16 02:57:17 INFO mapred.LocalJobRunner:
>>>>
>>>> Sometimes the messages contain a stacktrace like this below:
>>>> 12/08/16 01:57:40 WARN zookeeper.ClientCnxn: Session 0x1392cc35b460b40
>>>> for server localhost/fe80:0:0:0:0:0:0:1%1:2181, unexpected error, closing
>>>> socket connection and attempting reconnect
>>>> java.io.IOException: Connection reset by peer
>>>>  at sun.nio.ch.FileDispatcher.read0(Native Method)
>>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>>>>  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:166)
>>>>  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
>>>> at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856)
>>>>  at
>>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154)
>>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Opening socket connection
>>>> to server localhost/127.0.0.1:2181
>>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Socket connection
>>>> established to localhost/127.0.0.1:2181, initiating session
>>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: Unable to reconnect to
>>>> ZooKeeper service, session 0x1392cc35b460b40 has expired, closing socket
>>>> connection
>>>> 12/08/16 01:57:40 INFO zookeeper.ClientCnxn: EventThread shut down
>>>> 12/08/16 01:57:41 INFO zookeeper.ZooKeeper: Initiating client
>>>> connection, connectString=localhost sessionTimeout=30000
>>>> watcher=org.apache.accumulo.core.zookeeper.ZooSession$AccumuloWatcher@684a26e8
>>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Opening socket connection
>>>> to server localhost/fe80:0:0:0:0:0:0:1%1:2181
>>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Socket connection
>>>> established to localhost/fe80:0:0:0:0:0:0:1%1:2181, initiating session
>>>> 12/08/16 01:57:41 INFO zookeeper.ClientCnxn: Session establishment
>>>> complete on server localhost/fe80:0:0:0:0:0:0:1%1:2181, sessionid =
>>>> 0x1392cc35b460b46, negotiated timeout = 30000
>>>>
>>>>
>>>> I've poked through the logs in accumulo, and I've noticed that when it
>>>> hangs, the following is written to the "logger_HOSTNAME.debug.log" file:
>>>> 16 03:29:46,332 [logger.LogService] DEBUG: event null None Disconnected
>>>> 16 03:29:47,248 [zookeeper.ZooSession] DEBUG: Session expired, state of
>>>> current session : Expired
>>>> 16 03:29:47,248 [logger.LogService] DEBUG: event null None Expired
>>>> 16 03:29:47,249 [logger.LogService] WARN : Logger lost zookeeper
>>>> registration at null
>>>> 16 03:29:47,452 [logger.LogService] INFO : Logger shutting down
>>>> 16 03:29:47,453 [logger.LogWriter] INFO : Shutting down
>>>>
>>>>
>>>> I've noticed that if I make the map task print out the record count
>>>> more frequently (ie every 10 records), it seems to be able get through more
>>>> records than when I only print every 1K records. My assumption was that
>>>> this had something to do with more time being spent in the map task, and
>>>> not fetching data from accumulo.  There was at least one occasion where I
>>>> printed to the console for every record, and in that situation it managed
>>>> to process 47K records, although I have been unable to repeat that behavior.
>>>>
>>>> I've also noticed that if I stop and start accumulo, the map-reduce job
>>>> will pickup where it left off, but seems to fail quicker.
>>>>
>>>>
>>>>
>>>> Could someone make some suggestions as to what my problem might be? It
>>>> would be greatly appreciated.  If you need any additional information from
>>>> me, just let me know.  I'd paste my config files, driver setup, and example
>>>> data into this post, but I think it's probably long enough already.
>>>>
>>>>
>>>> Thanks in advance,
>>>> -Arjumand
>>>>
>>>>
>>>
>>
>

Mime
View raw message