db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "A B (JIRA)" <derby-...@db.apache.org>
Subject [jira] Updated: (DERBY-1777) Regression: query works in 10.1.2.1 but fails with NullPointerException in 10.2.1.1
Date Mon, 11 Sep 2006 22:43:25 GMT
     [ http://issues.apache.org/jira/browse/DERBY-1777?page=all ]

A B updated DERBY-1777:
-----------------------

    Attachment: d1777_v1.patch

I ran the ViewerInit program attached to this issue and I hit two different NPE's.  For more
details, see "Details" section below.

The short story is that I was able to determine the cause of the NPEs and have a patch, d1777_v1.patch,
to resolve them.  There is, however, another issue that prevents the ViewerInit program from
running to completion (more on that below).  Nonetheless, d1777_v1.patch is at least a step
in the right direction as it corrects the two compile-time NPEs described below.

I ran derbyall on Red Hat Linux with ibm142 and sane jars, and I saw the following failure
in jdbcapi/secureUsers1.sql:

33a34,37
> do_ypcall: clnt_call: RPC: Unable to receive; errno = Connection refused
> YPBINDPROC_DOMAIN: Domain not bound
> do_ypcall: clnt_call: RPC: Unable to receive; errno = Connection refused
> YPBINDPROC_DOMAIN: Domain not bound
Test Failed.

When I ran the test standalone it passes against all frameworks, so I'm not sure what happened.
 But in any event this does not appear to be related to my changes.

So I'm posting d1777_v1.patch for review.  Despite my efforts I haven't been able to come
up with a test case that can go into derbyall, but I'm still trying.  In the meantime, any
comments/feedback would be appreciated.

--------
Details
--------

The first NPE came from BinaryRelationalOperatorNode.getScopedOperand() and was caused by
the fact that, when scoping a predicate for pushing, Derby couldn't find the target result
column to which the scoped predicate was supposed to point.  I confirmed this by running in
SANE mode, where instead of an NPE I saw the following ASSERT FAILURE:

ERROR XJ001: Java exception: 'ASSERT FAILED Failed to locate scope target result column when
trying to scope operand 'ENTITY_TO_PORT.PORT_ID'.: org.apache.derby.shared.common.sanity.AssertFailure'.

An example query that leads to this NPE/assertion failure is as follows:

  SELECT DISTINCT

     ZONE.ZONE_ID ZONE_ID,
     PORT.PORT_ID PORT_ID,
     ENTITY_TO_PORT.TYPE,
     ENTITY_TO_PORT.PREFIX_ID,
     ENTITY_TO_PORT.ENTITY_ID,
     ENTITY_TO_PORT.DISPLAY_NAME,
     ENTITY_TO_PORT.PORT_DISPLAY_NAME,
     PORT2ZONE.MEMBER_NAME,
     PORT2ZONE.ZONE_MEMBER_ID,
     PORT.PORT_NUMBER

  FROM

     T_RES_ZONE ZONE
       left outer join
           T_VIEW_PORT2ZONE PORT2ZONE
       on
           ZONE.ZONE_ID = PORT2ZONE.ZONE_ID
     left outer join
           T_RES_PORT PORT
       on
           PORT2ZONE.PORT_ID = PORT.PORT_ID
     left outer join
           T_VIEW_ENTITY_TO_PORT ENTITY_TO_PORT
       on
           PORT2ZONE.PORT_ID = ENTITY_TO_PORT.PORT_ID
           and PORT2ZONE.ZONE_ID = ENTITY_TO_PORT.ZONE_ID,
     T_RES_FABRIC FABRIC

  WHERE

     PORT2ZONE.ZONE_ID = ZONE.ZONE_ID
     and ZONE.FABRIC_WWN = FABRIC.FABRIC_WWN
     and FABRIC.FABRIC_ID = ?

When scoping predicates for this query, we run into a situation where the target result column
corresponds to a subquery that has been flattened.  Since the process of flattening a query
leads to the creation of "redundant" result columns, we have to correctly handle the redundant
result columns in order to find the scope target column.  That said, the logic for redundant
result columns is in ColumnReference.getSourceResultSet(int[]):

        rcExpr = rc.getExpression();
        colNum[0] = getColumnNumber();

        while ((rcExpr != null) && (rcExpr instanceof ColumnReference))
        {
            colNum[0] = ((ColumnReference)rcExpr).getColumnNumber();
            rc = ((ColumnReference)rcExpr).getSource();

            /* If "rc" is redundant then that means ...
            ...
        }

The thing to note here is that the logic for handling redundant rc's is inside the "while"
loop.  This leads to an edge case that the above code won't catch: namely, if the original
"rc" as it exists BEFORE we enter the "while" loop is redundant, we'll only execute the redundancy
logic IF rcExpr is an instance of ColumnReference.  But there's no guarantee that rcExpr will
actually be a ColumnReference--and if it's not, we'll incorrectly skip the logic for handling
the redundant rc.  That in turn means we'll be unable to find the actual source result set,
and thus the method will return null, leading to the above-mentioned assertion failure/NPE.

To fix this, I made a small change to ensure that the redundancy logic always get executed
if rc is redundant:

-        while ((rcExpr != null) && (rcExpr instanceof ColumnReference))
+        /* We have to make sure we enter this loop if rc is redundant,
+         * so that we can navigate down to the actual source result
+         * set (DERBY-1777). If rc *is* redundant, then rcExpr is not
+         * guaranteed to be a ColumnReference, so we have to check
+         * for that case inside the loop.
+         */
+        while ((rcExpr != null) &&
+            (rc.isRedundant() || (rcExpr instanceof ColumnReference)))
         {
-            colNum[0] = ((ColumnReference)rcExpr).getColumnNumber();
-            rc = ((ColumnReference)rcExpr).getSource();
+            if (rcExpr instanceof ColumnReference)
+            {
+                colNum[0] = ((ColumnReference)rcExpr).getColumnNumber();
+                rc = ((ColumnReference)rcExpr).getSource();
+            }

             /* If "rc" is redundant then that means ...
             ...
         }
 
Once this change was made, the first NPE went away and the ViewerInit program ran a little
longer, then failed with a second NPE.  As it turns out, the second NPE is intermittent and
very time-sensitive.  When it happens, the failure occurs because the "outerCost" field that
is passed to a query subtree from OptimizerImpl.costPermutation() is null:

        /*
        ** Get the cost of the outer plan so far.  This gives us the current
        ** estimated rows, ordering, etc.
        */
        CostEstimate outerCost;
        if (joinPosition == 0)
        {
            outerCost = outermostCostEstimate;
        }
        else
        {
            /*
            ** NOTE: This is somewhat problematic.  We assume here that the
            ** outer cost from the best access path for the outer table
            ** is OK to use even when costing the sort avoidance path for
            ** the inner table.  This is probably OK, since all we use
            ** from the outer cost is the row count.
            */
            outerCost =
                optimizableList.getOptimizable(
                    proposedJoinOrder[joinPosition - 1]).
                        getBestAccessPath().getCostEstimate();
        }

At this point we expect outerCost to be non-null, but it turns out that there's a bug elsewhere
in the code that leads to a null outerCost here, which is then passed down the tree:

        /* Cost the optimizable at the current join position */
        optimizable.optimizeIt(this,
                               predicateList,
                               outerCost,
                               currentRowOrdering);

Any attempts to access outerCost further down the tree will then result in an NPE.

An example of a query that (intermittently) shows this NPE (against the "Aperi" database):

  SELECT DISTINCT

     ''server:'' || CAST(HOST2PORT.HOST_ID as CHAR) ENTITY_KEY,
     PORT2ZONE.ZONE_ID ZONE_ID

  FROM

     T_VIEW_VHOST2PORT HOST2PORT,
     T_VIEW_PORT2ZONE PORT2ZONE

  WHERE

     HOST2PORT.HOST_ID = ?
     and HOST2PORT.PORT_ID = PORT2ZONE.PORT_ID


The actual bug is in the getNextPermutation() method of the same class (OptimizerImpl):

    // If we were in the middle of a join order when this
    // happened, then reset the join order before jumping.
    // The call to rewindJoinOrder() here will put joinPosition
    // back to 0.  But that said, we'll then end up incrementing 
    // joinPosition before we start looking for the next join
    // order (see below), which means we need to set it to -1
    // here so that it gets incremented to "0" and then
    // processing can continue as normal from there.  Note:
    // we don't need to set reloadBestPlan to true here
    // because we only get here if we have *not* found a
    // best plan yet.
    if (joinPosition > 0)
    {
        rewindJoinOrder();
        joinPosition = -1;
    }

The problem with this code is that it only rewinds the join order if joinPosition is GREATER
than 0--but a joinPosition that is EQUAL to zero indicates that we're "in the middle of a
join order", as well, and thus we need to rewind in that case, too.  If we don't rewind, we
can end up with an invalid join order and that indirectly leads to the NPE mentioned above.

As a brief example, assume we have an optimizable list with two Optimizables in it, O1 and
O2.  Let's also assume that we've just finished optimizing the first one.  So the current
join order will be [O1, -].

Then timeout occurs so we enter the block of code in which the above "if" statement sits.
 At that point joinPosition will be "0" because we just found the best cost for the first
optimizable and we haven't incremented joinPosition yet.  We'll then "jump" to what we think
is going to be the best join order, which we call "firstLookOrder" (see the code for more
details).  Let's assume firstLookOrder is [O2,O1].  Now, because joinPosition is 0 we won't
enter the above the "if" block and thus we will NOT rewind the join order.  So we'll then
increment joinPosition to "1" and we'll choose the optimizable at firstLookOrder[joinPosition]
as the next one in the current join order.  firstLookOrder[1] returns optimizable "O1", which
means that, since we didn't "rewind" the join order, our new current join order becomes [O1,
O1]--which is not a valid join order.

The reason this leads to an NPE is that whenever an optimizable is placed, the best cost estimate
for that optimizable is set to null.  Thus when we place O1 at position "1" we set it's best
access path's cost estimate to null.  Then later, when we get to the costPermutation() code
shown above, we take the best cost of the optimizable at position "0" and use that as the
"outerCost" for the optimizable at position "1".  But in this  those two optimizables are
the SAME--namley, O1.  So we effectively nulled out O1's best cost, then we used that very
same (null) cost as the "outerCost" for optimizing O1.  When that outerCost is eventually
referenced later, we end up with the NPE.

All of that said, note the above "if" statement is only executed in situations where we have
an optimizer timeout at a very particular point during optimization.  This is why the NPE
is intermittent, and it also explains why it\ won't reproduce if optimizer timeout is disabled.

The fix for this NPE is a one-line change (plus relevant comment updates):

-    if (joinPosition > 0)
+    /* If we already assigned at least one position in the
+     * join order when this happened (i.e. if joinPosition
+     * is greater than *or equal* to zero; DERBY-1777), then 
+     * reset the join order before jumping.  The call to
+     * rewindJoinOrder() here will put joinPosition back
+     * to 0.  But that said, we'll then end up incrementing
+     * joinPosition before we start looking for the next
+     * join order (see below), which means we need to set
+     * it to -1 here so that it gets incremented to "0" and
+     * then processing can continue as normal from there.  
+     * Note: we don't need to set reloadBestPlan to true
+     * here because we only get here if we have *not* found
+     * a best plan yet.
+     */
+    if (joinPosition >= 0)

I'm attaching a patch, d1777_v1.patch, that makes these two changes to resolve the NPE's discussed
here.  Note, though, that I still need to add an appropriate test case to derbyall.  This
test case will only be for the first NPE; the second NPE is timing-dependent and will not
reproduce with "noTimeout" set to true, so I don't think we'll have a test case for that one.

Also note: with d1777_v1.patch applied, the repro program attached to this Jira still will
not run without error (sigh).  The two NPE's disappear and the test program gets to the "L2"
queries, but at that point the queries take a very (very) long time to compile and then fail
with an ASSERT failure at execution time:

org.apache.derby.shared.common.sanity.AssertFailure: ASSERT FAILED 
sourceResultSetNumber expected to be >= 0 for SWITCH.SWITCH_ID

That is (of course) with SANE jars; I don't know what that translates into for INSANE jars
because I haven't had the time to re-run the queries with sane jars.  It could end up being
an execution-time (as opposed to a compile-time) NPE but I don't that for sure.  I'm still
investigating.

> Regression: query works in 10.1.2.1 but fails with NullPointerException in 10.2.1.1
> -----------------------------------------------------------------------------------
>
>                 Key: DERBY-1777
>                 URL: http://issues.apache.org/jira/browse/DERBY-1777
>             Project: Derby
>          Issue Type: Bug
>         Environment: WinXP SP2 dualcore 2.8 GHz 2 GBmemory
>            Reporter: Prasenjit Sarkar
>         Assigned To: A B
>             Fix For: 10.2.1.0
>
>         Attachments: Aperi.zip, d1777_v1.patch, Derby1777.zip
>
>
> However, here's a query that works in 10.1.2.1 but not in 10.2.1.1  -- database can be
assumed to be the same in Derby - 1205
> SELECT DISTINCT 
> ZONE.ZONE_ID ZONE_ID, 
> PORT.PORT_ID PORT_ID, 
> ENTITY_TO_PORT.TYPE, 
> ENTITY_TO_PORT.PREFIX_ID, 
> ENTITY_TO_PORT.ENTITY_ID, 
> ENTITY_TO_PORT.DISPLAY_NAME, 
> ENTITY_TO_PORT.PORT_DISPLAY_NAME, 
> PORT2ZONE.MEMBER_NAME, 
> PORT2ZONE.ZONE_MEMBER_ID, 
> PORT.PORT_NUMBER 
> FROM 
> T_RES_ZONE ZONE left outer join T_VIEW_PORT2ZONE PORT2ZONE on 
> ZONE.ZONE_ID = PORT2ZONE.ZONE_ID left outer join T_RES_PORT PORT on 
> PORT2ZONE.PORT_ID = PORT.PORT_ID left outer join T_VIEW_ENTITY_TO_PORT ENTITY_TO_PORT
on 
> PORT2ZONE.PORT_ID = ENTITY_TO_PORT.PORT_ID and 
> PORT2ZONE.ZONE_ID = ENTITY_TO_PORT.ZONE_ID, T_RES_FABRIC FABRIC 
> WHERE PORT2ZONE.ZONE_ID = ZONE.ZONE_ID and 
> ZONE.FABRIC_WWN = FABRIC.FABRIC_WWN and 
> FABRIC.FABRIC_ID = 1 
> Same db as before. 
> In 10.2.1.1 it gives the following error (should this be a new issue?) 
> java.sql.SQLException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XJ001, SQLERRMC: java.lang.NullPointerExceptionXJ001.U

> at org.apache.derby.client.am.SQLExceptionFactory.getSQLException(Unknown Source) 
> at org.apache.derby.client.am.SqlException.getSQLException(Unknown Source) 
> at org.apache.derby.client.am.Connection.prepareStatement(Unknown Source) 
> at org.eclipse.aperi.server.guireq.topology.views.ViewerSanL1.init(ViewerSanL1.java:1828)

> at org.eclipse.aperi.server.guireq.topology.views.ViewerInit.init(ViewerInit.java:41)

> at org.eclipse.aperi.server.guireq.topology.views.ViewerInit.main(ViewerInit.java:69)

> Caused by: org.apache.derby.client.am.SqlException: DERBY SQL error: SQLCODE: -1, SQLSTATE:
XJ001, SQLERRMC: java.lang.NullPointerExceptionXJ001.U 
> at org.apache.derby.client.am.Statement.completeSqlca(Unknown Source) 
> at org.apache.derby.client.net.NetStatementReply.parsePrepareError(Unknown Source) 
> at org.apache.derby.client.net.NetStatementReply.parsePRPSQLSTTreply(Unknown Source)

> at org.apache.derby.client.net.NetStatementReply.readPrepareDescribeOutput(Unknown Source)

> at org.apache.derby.client.net.StatementReply.readPrepareDescribeOutput(Unknown Source)

> at org.apache.derby.client.net.NetStatement.readPrepareDescribeOutput_(Unknown Source)

> at org.apache.derby.client.am.Statement.readPrepareDescribeOutput(Unknown Source) 
> at org.apache.derby.client.am.PreparedStatement.readPrepareDescribeInputOutput(Unknown
Source) 
> at org.apache.derby.client.am.PreparedStatement.flowPrepareDescribeInputOutput(Unknown
Source) 
> at org.apache.derby.client.am.PreparedStatement.prepare(Unknown Source) 
> at org.apache.derby.client.am.Connection.prepareStatementX(Unknown Source) 
> ... 4 more 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message