drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Westin" <chriswesti...@gmail.com>
Subject Re: Review Request 31435: DRILL-2245-core: core changes for improved drillbit stability
Date Wed, 04 Mar 2015 00:21:48 GMT


> On Feb. 26, 2015, 9:56 p.m., Jacques Nadeau wrote:
> > exec/java-exec/src/test/java/org/apache/drill/exec/TestWithZookeeper.java, line
31
> > <https://reviews.apache.org/r/31435/diff/1/?file=876411#file876411line31>
> >
> >     why commented?  I see no downside to having an not-yet used logger in a class.

It produces a warning. I'm trying to eliminate all warnings. It's easy enough to uncomment
it the first time someone wants to use a logger in this file.


> On Feb. 26, 2015, 9:56 p.m., Jacques Nadeau wrote:
> > exec/java-exec/src/test/java/org/apache/drill/BaseTestQuery.java, line 61
> > <https://reviews.apache.org/r/31435/diff/1/?file=876404#file876404line61>
> >
> >     This is a stylistic change.  I think you need to propose this on the list and
we come to a group consensus.  As it is, this is inconsistent with the whole rest of the code
base.  Please discuss on the dev list before changing this everywhere.

Done. It looks like the ayes have it.


> On Feb. 26, 2015, 9:56 p.m., Jacques Nadeau wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/QueryManager.java,
line 313
> > <https://reviews.apache.org/r/31435/diff/1/?file=876399#file876399line313>
> >
> >     No, but that is on purpose.  We keep them forever for log analysis

Once this code is put into production, and a lot of queries are issued constantly, this is
going to grow quickly. We need to have a plan for the future to manage this. Submitted https://issues.apache.org/jira/browse/DRILL-2362
.


> On Feb. 26, 2015, 9:56 p.m., Jacques Nadeau wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/server/Drillbit.java, line 215
> > <https://reviews.apache.org/r/31435/diff/1/?file=876384#file876384line215>
> >
> >     Did you reorder for a sepcific reason?  Ideally we should unregister from the
coordination service before shutting things down.  In fact, we should probably also pause
before shutting things down after unregistering

I checked master, and the order has not changed. And we do unregister first, at the top of
close(). We also pause after unregistering and before shutting things down. I see no change
from master, other than the introduction of the new AutoCloseables.close() instead of try-catch
blocks.


> On Feb. 26, 2015, 9:56 p.m., Jacques Nadeau wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java, line
867
> > <https://reviews.apache.org/r/31435/diff/1/?file=876396#file876396line867>
> >
> >     Shouldn't this be protected?

FragmentSubmitFailures doesn't have any derived classes, so I don't see how that helps. It
is used to communicate between the various FragmentSubmitListeners that are created, and setupNonRootFragments(),
so they all need to see it. We could make it private, but then we just have to add a getter
and an incrementer, so protection doesn't really add anything.


> On Feb. 26, 2015, 9:56 p.m., Jacques Nadeau wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/work/WorkManager.java, line 343
> > <https://reviews.apache.org/r/31435/diff/1/?file=876394#file876394line343>
> >
> >     Do you make sure that Foremans no longer leak exceptions?  This was here to
ensure that we get an error rather than the thread terminating and the error showing up in
standard err.

Since we catch Exception in Foreman.run(), it looks like you can't get out without reporting
the error (as the Foreman moves to the FAILURE state).


> On Feb. 26, 2015, 9:56 p.m., Jacques Nadeau wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java, line
223
> > <https://reviews.apache.org/r/31435/diff/1/?file=876396#file876396line223>
> >
> >     is this guaranteed to be called or can something above prematurely end this
method via exception?

The listener deregistrations are fine, because they're just list removals. But to be safe,
I added a try-finally around the response send, just in case that takes some kind of IO or
RPC exception.


> On Feb. 26, 2015, 9:56 p.m., Jacques Nadeau wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java, line
659
> > <https://reviews.apache.org/r/31435/diff/1/?file=876396#file876396line659>
> >
> >     Canceled should be a terminal state from the user's perspective.  Not sure that
happens here or if you manage this some other way.

Any notion of "cancelled" that the user sees is independent of this. And that doesn't appear
to be available at on the client at this time -- this was why I was unable to complete any
automated testing. I discovered when I tried to use it that QueryState never gets back to
the client, and none of our tests seem to check it. They only look for last chunk, or some
expected number of rows. I expect to do some work on this in DRILL-1802, and have added a
TODO(DRILL-1802) to the code in the state machine.

Meanwhile, the code wasn't consistent wrt the principle you're proposing, namely, that CANCELLED
is a terminal state. Some code treated it that way, and some didn't. The relevant issue is
that Foreman.cancel() immediately effects a state transition to CANCELLED, and then sends
cancellation requests to all remote fragments. When those all acknowledge this, then Foreman
moves to COMPLETED, at which point it may clean up resources associated with the query. There
were risks with the previous inconsistent handling, which immediately cleaned up, because
then cancellation acknowledgements would arrive and be delivered to apparently broken state
(via QueryManager.statusUpdate() inherited from FragmentStatusListener) because resources
may or may not have been cleaned up (depending on timing), possibly causing more problems.
Submitted DRILL-2370 for this, and added a TODO(DRILL-2370) in QueryManager.statusUpdate(),
even though this is not likely to be the site of the work that needs to be don
 e.


> On Feb. 26, 2015, 9:56 p.m., Jacques Nadeau wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/work/fragment/FragmentExecutor.java,
line 132
> > <https://reviews.apache.org/r/31435/diff/1/?file=876401#file876401line132>
> >
> >     We need to make sure we have everything closed before returning success.  You've
removed that functionality by eating (only logging) the exceptions.  If a query leaks anything,
I want the query to fail so it shows up in all the tests.  I'm okay disabling this functionality
in production mode but definitely want it to be the case in normal development and testing.
 If you disagree, we should discuss further because I see this a show stopper for this patch.

It's odd that you marked this line, when that was the *only* one that threw exceptions if
there were problems cleaning up. There were a few other calls to closeOutResources(false),
which wouldn't report problems. But if you want to know "If a query leaks anything,...." we
should do that consistently, even for the failure cases. Otherwise, if queries fail, but silently
leak resources, then the server eventually runs out of resources anyway.

I think it's great to report such things, but we should do them always or never. Not just
on success. Reporting during test and development is fine. I've altered closeOutResources()
so that it uses AssertionUtil.IsAssertionsEnabled() to determine whether we should report
or not, rather than leaving it up to the caller.


- Chris


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31435/#review74282
-----------------------------------------------------------


On March 3, 2015, 4:21 p.m., Chris Westin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/31435/
> -----------------------------------------------------------
> 
> (Updated March 3, 2015, 4:21 p.m.)
> 
> 
> Review request for drill and Jacques Nadeau.
> 
> 
> Bugs: DRILL-2245
>     https://issues.apache.org/jira/browse/DRILL-2245
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> DRILL-2245-core: Clean up query setup and execution kickoff in Foreman/WorkManager in
order to ensure consistent handling, and avoid hangs and races, with the goal of improving
Drillbit robustness.
> 
>     I did my best to keep these clean when I split them up, but this core commit
>     may depend on some minor changes in the hygiene commit that is also
>     associated with this bug, so either both should be applied, or neither.
>     The core commit should be applied first.
> 
>     AutoCloseables
>     - created org.apache.drill.common.AutoCloseables to handle closing these
>       quietly
> 
>     BaseTestQuery, and derivatives
>     - factored out pieces into QueryTestUtil so they can be reused
> 
>     Drillbit
>     - uses AutoCloseables for the WorkManager and for the storeProvider
>     - allow start() to take a RemoteServiceSet
>     - private, final, formatting
> 
>     Foreman
>        - does not need to implement Comparable
>     - does not need to implement Closeable
>     - thread blocking fixes
>     - add resultSent flag
>     - add code to log plan fragments with endpoint assignments
>     - added finals, cleaned up formatting
>     - do queue management in acquireQuerySemaphore; local tests pass
>     - rename getContext() to getQueryContext()
>     - retain DrillbitContext
>     - a couple of exception injections for testing
>     - minor formatting
>     - TODOs
> 
>     FragmentExecutor
>     - eliminated CancelableQuery
>     - common subexpression elimination
>     - cleaned up
> 
>     QueryContext
>     - removed unnecessary functions (with some outside classes tweaked for this)
>     - finals, formatting
> 
>     QueryManager
>     - merge in QueryStatus
>       - affects Foreman, ../batch/ControlHandlerImpl,
>         and ../../server/rest/ProfileResources
>     - made some methods private
>     - removed unused imports
>     - add finals and formatting
>     - variable renaming to improve readability
>     - formatting
>     - comments
>     - TODOs
> 
>     QueryStatus
>     - getAsInfo() private
>     - member renaming
>     - member access changes
>     - formatting
>     - TODOs
> 
>     QueryTestUtil, BaseTestQuery, TestDrillbitResilience
>     - make maxWidth a parameter to server startup
> 
>     SelfCleaningRunnable
>     - created org.apache.drill.common.SelfCleaningRunnable
> 
>     SingleRowListener
>     - created org.apache.drill.SingleRowListener results listener
>     - use in TestDrillbitResilience
>     
>     TestDrillbitResilience
>     - created org.apache.drill.exec.server.TestDrillbitResilience to test drillb
>       resilience in the face of exceptions and failures during queries
> 
>     TestWithZookeeper
>     - factor out work into ZookeeperHelper so that it can be reused by
>       TestDrillbitResilience
> 
>     UserBitShared
>     - get rid of unused UNKNOWN_QUERY
> 
>     WorkEventBus
>     - rename methods, affects Foreman and ControlHandlerImpl
>     - remove unused WorkerBee reference
>     - most members final
>     - formatting
> 
>     WorkManager
>     - Closeable to AutoCloseable
>     - removed unused incomingFragments Set
>     - eliminated unnecessary eventThread and pendingTasks by posting Runnables
>       directly to executor
>     - use SelfCleaningRunnable for Foreman management
>     - FragmentExecutor management uses SelfCleaningRunnable
>     - runningFragments to be a ConcurrentHashMap; TestTpchDistributed passes
>     - other improvements due to bee no longer needed in various places
>     - most members final
>     - minor formatting
>     - comments
>     - TODOs
> 
>     (*) Created exception injection classes to simulate exceptions for testing
>     - ExceptionInjection
>     - ExceptionInjector
>     - ExceptionInjectionUtil
>     - TestExceptionInjection
> 
> 
> Diffs
> -----
> 
>   common/src/main/java/org/apache/drill/common/AutoCloseables.java PRE-CREATION 
>   common/src/main/java/org/apache/drill/common/SelfCleaningRunnable.java PRE-CREATION

>   exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java 5efcce8 
>   exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java 04b955b

>   exec/java-exec/src/main/java/org/apache/drill/exec/coord/local/LocalClusterCoordinator.java
035c1aa 
>   exec/java-exec/src/main/java/org/apache/drill/exec/rpc/control/WorkEventBus.java d6b8637

>   exec/java-exec/src/main/java/org/apache/drill/exec/server/Drillbit.java 67342c4 
>   exec/java-exec/src/main/java/org/apache/drill/exec/server/DrillbitContext.java 83a89df

>   exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java
aa0a5ad 
>   exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileResources.java
ae04bad 
>   exec/java-exec/src/main/java/org/apache/drill/exec/testing/ExceptionInjection.java
PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/testing/ExceptionInjector.java PRE-CREATION

>   exec/java-exec/src/main/java/org/apache/drill/exec/testing/InjectionSite.java PRE-CREATION

>   exec/java-exec/src/main/java/org/apache/drill/exec/testing/SimulatedExceptions.java
PRE-CREATION 
>   exec/java-exec/src/main/java/org/apache/drill/exec/work/CancelableQuery.java 5b11943

>   exec/java-exec/src/main/java/org/apache/drill/exec/work/StatusProvider.java 6086f74

>   exec/java-exec/src/main/java/org/apache/drill/exec/work/WorkManager.java 99c6ab8 
>   exec/java-exec/src/main/java/org/apache/drill/exec/work/batch/ControlHandlerImpl.java
3228da9 
>   exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java 378e81a

>   exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/FragmentData.java 52fd0a9

>   exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/FragmentStatusListener.java
6a719d2 
>   exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/QueryManager.java 2de3592

>   exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/QueryStatus.java 4e18da6

>   exec/java-exec/src/main/java/org/apache/drill/exec/work/fragment/FragmentExecutor.java
7ccb64e 
>   exec/java-exec/src/main/java/org/apache/drill/exec/work/fragment/NonRootFragmentManager.java
3671804 
>   exec/java-exec/src/main/java/org/apache/drill/exec/work/fragment/RootFragmentManager.java
54fc8c4 
>   exec/java-exec/src/test/java/org/apache/drill/BaseTestQuery.java cf99577 
>   exec/java-exec/src/test/java/org/apache/drill/PlanTestBase.java c52545d 
>   exec/java-exec/src/test/java/org/apache/drill/QueryTestUtil.java PRE-CREATION 
>   exec/java-exec/src/test/java/org/apache/drill/SingleRowListener.java PRE-CREATION 
>   exec/java-exec/src/test/java/org/apache/drill/TestBugFixes.java 51f3121 
>   exec/java-exec/src/test/java/org/apache/drill/TestBuilder.java 978e565 
>   exec/java-exec/src/test/java/org/apache/drill/exec/DrillSystemTestBase.java 75ba3a9

>   exec/java-exec/src/test/java/org/apache/drill/exec/TestWithZookeeper.java bb69c9a 
>   exec/java-exec/src/test/java/org/apache/drill/exec/ZookeeperHelper.java PRE-CREATION

>   exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/TestOptiqPlans.java
4aaaa78 
>   exec/java-exec/src/test/java/org/apache/drill/exec/server/TestDrillbitResilience.java
PRE-CREATION 
>   exec/java-exec/src/test/java/org/apache/drill/exec/testing/ExceptionInjectionUtil.java
PRE-CREATION 
>   exec/java-exec/src/test/java/org/apache/drill/exec/testing/TestExceptionInjection.java
PRE-CREATION 
>   protocol/src/main/protobuf/UserBitShared.proto 8f05c45 
> 
> Diff: https://reviews.apache.org/r/31435/diff/
> 
> 
> Testing
> -------
> 
> mvn install
> Functional - Passing - New
> Advanced - TPCH SF100 - Parquet
> 
> 
> Thanks,
> 
> Chris Westin
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message