accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Scanning with many singleton ranges?
Date Fri, 03 Apr 2015 05:52:19 GMT
Thanks! I didn't even have to ask :)

Dylan Hutchison wrote:
> Moving discussion to JIRA: ACCUMULO-3710
> <https://issues.apache.org/jira/browse/ACCUMULO-3710>
> ~Dylan
>
>
> On Fri, Apr 3, 2015 at 12:09 AM, Dylan Hutchison <dhutchis@mit.edu
> <mailto:dhutchis@mit.edu>> wrote:
>
>     Yes, definitely OOME.  My friend Eric crashed Accumulo again and we
>     saw this in tserver_localhost.out:
>
>     #
>     # java.lang.OutOfMemoryError: Java heap space
>     # -XX:OnOutOfMemoryError="kill -9 %p"
>     #   Executing /bin/sh -c "kill -9 12833"...
>
>
>     On Thu, Apr 2, 2015 at 11:49 PM, Dylan Hutchison <dhutchis@mit.edu
>     <mailto:dhutchis@mit.edu>> wrote:
>
>         I think it is an OOME.  Here's the debug log file, showing a
>         clear descend from 189MB free to 52kB free memory before
>         manually restarting the tserver 4 minutes later.  Looks like I
>         lost the .err files for now; would need to reproduce the crash
>         to get them again.
>
>         2015-03-26 08:34:01,242 [tserver.TabletServer] DEBUG: gc
>         ParNew=26.24(+0.01) secs ConcurrentMarkSweep=0.13(+0.00)
>         secs*freemem=189,300,488(-330,224) *totalmem=259,522,560
>         2015-03-26 08:34:01,549 [tserver.TabletServer] DEBUG: ScanSess
>         tid 127.0.0.1:55823 <http://127.0.0.1:55823> 6r 374,161 entries
>         in 2.98 secs, nbTimes = [1 69 3.27 375]
>         2015-03-26 08:34:01,842 [Audit   ] INFO : operation: permitted;
>         user: root; client: 127.0.0.1:55823 <http://127.0.0.1:55823>;
>         2015-03-26 08:34:01,842 [Audit   ] INFO : operation: permitted;
>         user: root; client: 127.0.0.1:55823 <http://127.0.0.1:55823>;
>         2015-03-26 08:34:01,844 [tserver.TabletServer] DEBUG: ScanSess
>         tid 127.0.0.1:55823 <http://127.0.0.1:55823> !0 5 entries in
>         0.00 secs, nbTimes = [1 1 1.00 1]
>         2015-03-26 08:34:03,034 [tserver.TabletServer] DEBUG: Got
>         getScans message from user: !SYSTEM
>         2015-03-26 08:34:03,091 [tserver.TabletServer] DEBUG:
>         MultiScanSess 127.0.0.1:38998 <http://127.0.0.1:38998> 2 entries
>         in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
>         2015-03-26 08:34:04,507 [tserver.TabletServer] DEBUG: gc
>         ParNew=26.38(+0.14) secs ConcurrentMarkSweep=0.99(+0.86) secs
>         *freemem=44,246,264(-145,384,448) *totalmem=259,522,560
>         2015-03-26 08:34:05,963 [tserver.TabletServer] DEBUG: ScanSess
>         tid 127.0.0.1:55017 <http://127.0.0.1:55017> !0 0 entries in
>         0.00 secs, nbTimes = [2 2 2.00 1]
>         2015-03-26 08:34:05,966 [tserver.TabletServer] DEBUG: gc
>         ParNew=26.38(+0.00) secs ConcurrentMarkSweep=2.25(+1.26) secs
>         *freemem=6,657,016(-182,973,696) *totalmem=259,522,560
>         2015-03-26 08:34:07,549 [tserver.TabletServer] DEBUG: gc
>         ParNew=26.38(+0.00) secs ConcurrentMarkSweep=3.73(+1.48) secs
>         *freemem=439,152(-189,191,560) *totalmem=259,522,560
>         2015-03-26 08:34:08,284 [tserver.TabletServer] DEBUG: Got
>         getScans message from user: !SYSTEM
>         *2015-03-26 08:34:10,469 [tserver.TabletServer] WARN : Running
>         low on memory*
>         2015-03-26 08:34:10,470 [tserver.TabletServer] DEBUG: gc
>         ParNew=26.38(+0.00) secs ConcurrentMarkSweep=6.63(+2.90) secs
>         *freemem=52,816(-189,577,896) *totalmem=259,522,560
>         2015-03-26 08:34:14,623 [tserver.TabletServer] DEBUG: Got
>         getScans message from user: !SYSTEM
>         2015-03-26 08:34:17,382 [tserver.TabletServer] DEBUG: ScanSess
>         tid 127.0.0.1:55017 <http://127.0.0.1:55017> !0 0 entries in
>         5.04 secs, nbTimes = [4,972 4,972 4,972.00 1]
>         2015-03-26 08:34:24,674 [tserver.TabletServer] DEBUG: Got
>         getScans message from user: !SYSTEM
>         2015-03-26 08:34:35,716 [cache.LruBlockCache] DEBUG: Cache
>         Stats: Sizes: Total=23.286858MB (24418040), Free=6.7131424MB
>         (7039240), Max=30.0MB (31457280), Counts: Blocks=7750,
>         Access=125628, Hit=102578, Miss=23050, Evictions=25,
>         Evicted=15299, Ratios: Hit Ratio=81.65218234062195%, Miss
>         Ratio=18.34782063961029%, Evicted/Run=611.9600219726562,
>         Duplicate Reads=1
>         *2015-03-26 08:38:37,256 [server.Accumulo] INFO : tserver starting*
>
>
>
>
>
>         On Thu, Apr 2, 2015 at 6:34 PM, Josh Elser <josh.elser@gmail.com
>         <mailto:josh.elser@gmail.com>> wrote:
>
>             That seems perfectly reasonable to me, IMO. I'm surprised to
>             hear the tserver crashed.
>
>             Taking a quick glance at the code, it looks like this would
>             be a good place to do some optimization in the
>             BatchScanner's impl (TabletServerBatchReaderImpl). The
>             BatchScanner will bin the ranges to the tablets and the
>             servers hosting those tablets. Normally, this would be
>             spread out, but, in your single server case, all 1M rows
>             would all go to a single TabletServer in one RPC call.
>
>             I'm guessing a good optimization here would be to check the
>             size of a batch of Ranges for a single tabletserver, and
>             when above a certain threshold, split the batch in half and
>             try to reprocess each half (the recursion would naturally
>             keep splitting until we get down to some high-watermark).
>
>             Point being, if your client VM constructed the Ranges
>             without issue, the BatchScanner impl should be smart enough
>             to not knock over a TabletServer.
>
>             What was the reason the tserver died? OOME? Was there
>             anything at the end of the log files or in the .out/.err files?
>
>             - Josh
>
>
>             Dylan Hutchison wrote:
>
>                 A friend of mine has a use case where he wants to scan
>                 ~1M individual
>                 rows, scattered across a ~15GB table.  He performed the
>                 following:
>
>                 1. Gather a List of Range objects, each one a singleton
>                 range spanning
>                 an entire row.
>                 2. Create a BatchScanner with one read thread.
>                 3. Set the ranges via BatchScanner.setRanges()
>                 4. Start iterating through the scanner.
>
>                 Performing these steps crashed the TabletServer for my
>                 friend (haven't
>                 had time to verify it myself yet). We're using a
>                 single-node standalone
>                 1.6.1 Accumulo instance.
>
>                 Is this a bad way to use Accumulo?  I advised my friend
>                 to batch the
>                 reads into groups of ~10k ranges and see if that helps.
>                 I wanted to
>                 check with the community and see if we're doing
>                 something weird.  If the
>                 behavior should have worked, I can try to put together a
>                 test case
>                 reproducing it, that creates a table with many entries
>                 and then scans
>                 with many ranges.
>
>                 Thanks,
>                 Dylan Hutchison
>
>
>
>

Mime
View raw message