drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Omernik (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-4723) GC Memory Allocation Issues
Date Tue, 14 Jun 2016 19:09:27 GMT
John Omernik created DRILL-4723:
-----------------------------------

             Summary: GC Memory Allocation Issues
                 Key: DRILL-4723
                 URL: https://issues.apache.org/jira/browse/DRILL-4723
             Project: Apache Drill
          Issue Type: Bug
          Components:  Server
    Affects Versions: 1.6.0
         Environment: 5 Drill bits, MapR 1.6 release. 84GB of Direct Memory, 24GB of Heap
            Reporter: John Omernik
             Fix For: Future


This issue is reposted from the Drill User List.  More logs available on request in comments
(please specify which logs you may want)

High level: Drill Cluster gets into a bad state when this occurs. 


This is what I have thus far... I can provide more complete logs on a one on one basis. 

The cluster was completely mine, with fresh logs. I ran a CTAS query on a large table that
over 100 fields. This query works well in other cases, however I was working with the Block
size, both in MapR FS and Drill Parquet. I had successfully tested 512m on each, this case
was different. Here are the facts in this setup:

- No Compression in MapRFS - Using Standard Parquet Snappy Compression
- MapR Block Size 1024m
- Parquet Block size 1024m
- Query  ends up disappearing in the profiles
- The UI page listing bits only show 4 bits however 5 are running (node 03 process is running,
but no longer in the bit)
- Error (copied below)  from rest API
- No output in STD out or STD error on node3. Only two nodes actually had "Parquet Writing"
logs. The other three on Stdout, did not have any issues/errors/
- I have standard log files, gclogs, the profile.json (before it disappeared), and the physical
plan.  Only some components that looked possibly at issue included here
- The Node 3 GC log shows a bunch of "Full GC Allocation Failures"  that take 4 seconds or
more (When I've seen this in other cases, this time can balloon to 8 secs or more)
- The node 3 output log show some issues with really long RPC issues. Perhaps the GCs prevent
RPC communication and create a snowball loop effect?

Other logs if people are interested can be provided upon request. I just didn't want to flood
the whole list with all the logs. 

Thanks!

John





Rest Error:
./load_day.py 2016-05-09
Drill Rest Endpoint: https://drillprod.marathonprod.zeta:20000
Day: 2016-05-09
/usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769: InsecureRequestWarning: Unverified
HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html
  InsecureRequestWarning)
Authentication successful
Error encountered: 500
{
  "errorMessage" : "SYSTEM ERROR: ForemanException: One more more nodes lost connectivity
during query.  Identified nodes were [atl1ctuzeta03:20001].\n\n\n[Error Id: d7dd0120-f7c0-44ef-ac54-29c746b26354
on atl1ctuzeta01:20001"
}

Possible issue in Node3 Log:

2016-06-14 17:25:27,860 [289fc208-7266-6a81-73a1-709efff6c412:frag:1:90] INFO  o.a.d.e.w.f.FragmentStatusReporter
- 289fc208-7266-6a81-73a1-709efff6c412:1:90: State to report: RUNNING

2016-06-14 17:25:27,871 [289fc208-7266-6a81-73a1-709efff6c412:frag:1:70] INFO  o.a.d.e.w.fragment.FragmentExecutor
- 289fc208-7266-6a81-73a1-709efff6c412:1:70: State change requested AWAITING_ALLOCATION -->
RUNNING

2016-06-14 17:25:27,871 [289fc208-7266-6a81-73a1-709efff6c412:frag:1:70] INFO  o.a.d.e.w.f.FragmentStatusReporter
- 289fc208-7266-6a81-73a1-709efff6c412:1:70: State to report: RUNNING

2016-06-14 17:43:41,869 [BitServer-4] WARN  o.a.d.exec.rpc.control.ControlClient - Message
of mode RESPONSE of rpc type 1 took longer than 500ms.  Actual duration was 4192ms.

2016-06-14 17:45:36,720 [CONTROL-rpc-event-queue] INFO  o.a.d.e.w.fragment.FragmentExecutor
- 289fc208-7266-6a81-73a1-709efff6c412:1:0: State change requested RUNNING --> CANCELLATION_REQUESTED

2016-06-14 17:45:45,740 [CONTROL-rpc-event-queue] INFO  o.a.d.e.w.f.FragmentStatusReporter
- 289fc208-7266-6a81-73a1-709efff6c412:1:0: State to report: CANCELLATION_REQUESTED

2016-06-14 17:46:15,318 [BitServer-3] WARN  o.a.d.exec.rpc.control.ControlServer - Message
of mode REQUEST of rpc type 6 took longer than 500ms.  Actual duration was 55328ms.

2016-06-14 17:46:36,057 [CONTROL-rpc-event-queue] INFO  o.a.d.e.w.fragment.FragmentExecutor
- 289fc208-7266-6a81-73a1-709efff6c412:1:5: State change requested RUNNING --> CANCELLATION_REQUESTED

2016-06-14 17:46:44,620 [CONTROL-rpc-event-queue] INFO  o.a.d.e.w.f.FragmentStatusReporter
- 289fc208-7266-6a81-73a1-709efff6c412:1:5: State to report: CANCELLATION_REQUESTED

2016-06-14 17:47:01,393 [BitServer-3] WARN  o.a.d.exec.rpc.control.ControlServer - Message
of mode REQUEST of rpc type 6 took longer than 500ms.  Actual duration was 29781ms.

2016-06-14 17:47:09,463 [CONTROL-rpc-event-queue] INFO  o.a.d.e.w.fragment.FragmentExecutor
- 289fc208-7266-6a81-73a1-709efff6c412:1:10: State change requested RUNNING --> CANCELLATION_REQUESTED

2016-06-14 17:47:26,967 [CONTROL-rpc-event-queue] INFO  o.a.d.e.w.f.FragmentStatusReporter
- 289fc208-7266-6a81-73a1-709efff6c412:1:10: State to report: CANCELLATION_REQUESTED

2016-06-14 17:47:55,593 [BitServer-3] WARN  o.a.d.exec.rpc.control.ControlServer - Message
of mode REQUEST of rpc type 6 took longer than 500ms.  Actual duration was 46130ms.

2016-06-14 17:48:04,497 [CONTROL-rpc-event-queue] INFO  o.a.d.e.w.fragment.FragmentExecutor
- 289fc208-7266-6a81-73a1-709efff6c412:1:15: State change requested RUNNING --> CANCELLATION_REQUESTED

2016-06-14 17:48:12,742 [BitServer-4] WARN  o.a.d.exec.rpc.control.ControlClient - Message
of mode RESPONSE of rpc type 1 took longer than 500ms.  Actual duration was 4236ms.

2016-06-14 17:48:42,328 [CONTROL-rpc-event-queue] INFO  o.a.d.e.w.f.FragmentStatusReporter
- 289fc208-7266-6a81-73a1-709efff6c412:1:15: State to report: CANCELLATION_REQUESTED

2016-06-14 17:49:36,351 [BitServer-4] WARN  o.a.d.exec.rpc.control.ControlClient - Message
of mode RESPONSE of rpc type 1 took longer than 500ms.  Actual duration was 4260ms.

2016-06-14 17:49:36,351 [BitServer-3] WARN  o.a.d.exec.rpc.control.ControlServer - Message
of mode REQUEST of rpc type 6 took longer than 500ms.  Actual duration was 91854ms.

2016-06-14 17:50:35,273 [CONTROL-rpc-event-queue] INFO  o.a.d.e.w.fragment.FragmentExecutor
- 289fc208-7266-6a81-73a1-709efff6c412:1:20: State change requested RUNNING --> CANCELLATION_REQUESTED

2016-06-14 17:50:39,322 [289fc208-7266-6a81-73a1-709efff6c412:frag:1:15] INFO  o.a.d.e.w.fragment.FragmentExecutor
- 289fc208-7266-6a81-73a1-709efff6c412:1:15: State change requested CANCELLATION_REQUESTED
--> FAILED

2016-06-14 17:50:51,546 [CONTROL-rpc-event-queue] INFO  o.a.d.e.w.f.FragmentStatusReporter
- 289fc208-7266-6a81-73a1-709efff6c412:1:20: State to report: CANCELLATION_REQUESTED

2016-06-14 17:51:36,905 [BitServer-4] WARN  o.a.d.exec.rpc.control.ControlClient - Message
of mode RESPONSE of rpc type 1 took longer than 500ms.  Actual duration was 4426ms.

2016-06-14 17:52:30,805 [BitServer-3] WARN  o.a.d.exec.rpc.control.ControlServer - Message
of mode REQUEST of rpc type 6 took longer than 500ms.  Actual duration was 98767ms.

2016-06-14 17:52:47,042 [BitServer-4] WARN  o.a.d.exec.rpc.control.ControlClient - Message
of mode RESPONSE of rpc type 1 took longer than 500ms.  Actual duration was 12041ms.

Possible issues in Node3 gclog:

1819.137: [Full GC (Allocation Failure)  23G->23G(24G), 4.0657064 secs]

1823.205: [GC concurrent-mark-abort]

1823.221: [GC pause (G1 Evacuation Pause) (young)-- 23G->23G(24G), 0.0382934 secs]

1823.271: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 23G->23G(24G), 0.0250215
secs]

1823.296: [GC concurrent-root-region-scan-start]

1823.296: [GC concurrent-root-region-scan-end, 0.0000105 secs]

1823.296: [GC concurrent-mark-start]

1823.308: [Full GC (Allocation Failure)  23G->23G(24G), 4.3719713 secs]

1827.694: [GC pause (G1 Evacuation Pause) (young)-- 23G->23G(24G), 0.0312666 secs]

1827.727: [GC concurrent-mark-abort]

1827.735: [GC pause (G1 Evacuation Pause) (young) 23G->23G(24G), 0.0212359 secs]

1827.766: [Full GC (Allocation Failure)  23G->23G(24G), 3.9308980 secs]

1831.710: [GC pause (G1 Evacuation Pause) (young)-- 23G->23G(24G), 0.0298229 secs]

1831.750: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 23G->23G(24G), 0.0268410
secs]

1831.777: [GC concurrent-root-region-scan-start]

1831.777: [GC concurrent-root-region-scan-end, 0.0000212 secs]

1831.777: [GC concurrent-mark-start]

1831.789: [Full GC (Allocation Failure)  23G->23G(24G), 3.9250410 secs]





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message