Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm
Precedence: bulk
Reply-To: <derby-dev@db.apache.org>
Received-SPF: pass (nike.apache.org: domain of rick.hillegas@oracle.com
 designates 141.146.126.69 as permitted sender)
Message-ID: <525D30AA.2050403@oracle.com>
Date: Tue, 15 Oct 2013 05:10:18 -0700
From: Rick Hillegas <rick.hillegas@oracle.com>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.7; en-US;
 rv:1.9.2.18) Gecko/20110616 Thunderbird/3.1.11
MIME-Version: 1.0
To: derby-dev@db.apache.org
Subject: Re: Store api question: how to ask for RowLocations
References: <524B0663.5040707@oracle.com> <524B4A9D.1080109@gmail.com>
 <524C20E0.5010605@oracle.com> <524F0AED.70006@oracle.com>
In-Reply-To: <524F0AED.70006@oracle.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi Mike,

There's a revised patch, 
derby-3155-03-af-backingStoreHashtableWithRowLocation.diff, waiting for 
your review when you have the cycles.

Thanks,
-Rick

On 10/4/13 11:37 AM, Rick Hillegas wrote:
> Hi Mike,
>
> I have attached a patch to DERBY-3155 which introduces 
> BackingStoreHashtables which include RowLocation information: 
> derby-3155-03-ae-backingStoreHashtableWithRowLocation.diff. I would 
> appreciate your feedback.
>
> Thanks,
> -Rick
>
> On 10/2/13 6:34 AM, Rick Hillegas wrote:
>> Thanks for the quick response, Mike. Some more discussion inline...
>>
>> On 10/1/13 3:20 PM, mike matrigali wrote:
>>> I don't have any good answers here, but maybe some places to look - and
>>> some questions.
>>>
>>> Are you going to ever need RowLocations of rows in an index?  If so 
>>> this
>>> is going to be very new territory and Derby has never done that.  
>>> For a btree the RowLocation would be just the actual row as the 
>>> location is
>>> definined by the key - there is no other quick way given regular row
>>> level locking as the row is free to move from page to page and slot to
>>> slot.  The abstraction of RowLocation was designed to handle this as
>>> we wanted to be able to support a btree base table if necessary, but
>>> no actual implementation was ever done.
>> The MERGE statement shouldn't need the RowLocations of index rows. 
>> MERGE is only interested in the base rows.
>>>
>>> Usually when you see a reference in the code about a RowLocation being
>>> at column "N +1" it is usually an index where the code assumes the
>>> RowLocation at the end of the row is the Rowlocation of the 
>>> associated row in the heap.  So might
>>> be confusing if what you are looking for is the RowLocation of the
>>> current row.  In the case of indexes this row location is actually
>>> stored as the N+1 column so makes sense returning it in the row.
>> Thanks. I can see that avoiding that pattern will reduce confusion.
>>>
>>> You might look at current interfaces that use the 
>>> RowLocationRetRowSource.  I don't think any of these solve your 
>>> current problem but may give insight in how it was handled in the 
>>> past.  This looks like at least one approach in the
>>> past to allow caller access to RowLocations from bulk type scans.  I 
>>> think it is mostly used currently to scan table once and then build 
>>> indexes.  In this case it is left up to the caller to maintain the
>>> separate information about each row.
>> Thanks, I'll take a look at that.
>>>
>>> Is there some write up on the algo needed for merge so that I could
>>> understand the requirements of the interface.   I have not read up
>>> on this project so if it is already documented just point me there.
>> The issue is DERBY-3155. There's a functional spec attached to that 
>> issue. The implementation is evolving as I feel my way forward. A 
>> high level description of the approach I'm trying right now is 
>> described in a 2013-08-20 comment on that issue. In a nutshell, this 
>> is it:
>>
>> o First run a left join to determine the list of rows which need to 
>> be touched.
>> o As the left join is processed, figure out which (if any) MERGE 
>> action applies to each row. Each MERGE action will have its own 
>> temporary table for buffering these rows.
>> o Then use the temporary tables to drive the corresponding MERGE 
>> actions.
>>
>> The RowLocations are needed for the DELETE and UPDATE actions.
>>>
>>> At the interface level a key question is if the generic openScan 
>>> interface needs to change, once that happens a lot of the other 
>>> interfaces need to change also.  There are a lot of interfaces that
>>> were added for better scan performance for a specific need so maybe
>>> this is just another one.
>>>
>>> A clean interface that comes to mind would be to create a new class for
>>> row return that is more than just Object[].  In this case it is likely
>>> 2 fields: Object[] and RowLocation.  Then probably a new type of 
>>> create hash table that create one loaded with these new types of rows.
>>> And then alter the interfaces to build
>>> this extra overhead if necessary.  I like this approach rather than 
>>> adding the "fake" filed onto the end of the row as it avoids bugs
>>> that incorrectly treat the field as a real field for such things as
>>> hashing, sorting, duplicate key determination, ...
>> Thanks, I like that approach.
>>>
>>> It is my understanding that hash tables are one of the key 
>>> performance features of the system
>>> currently so would be nice to not add overhead to the main line path
>>> for this feature if possible.
>> Agreed. That has been my approach so far.
>>
>> Thanks,
>> -Rick
>>>
>>> On 10/1/2013 10:29 AM, Rick Hillegas wrote:
>>>> I need some advice about how to design an api for requesting that the
>>>> Store include RowLocations in the rows that it scans and hands back to
>>>> the language layer.
>>>>
>>>> The immediate problem that I'm working on involves implementing the
>>>> MERGE statement (DERBY-3155). Part of the implementation involves
>>>> cooking up a left join between two tables. I need to get back
>>>> RowLocations for the right table of that join. In a particular problem
>>>> case which I'm examining, the optimizer picks a HashJoin strategy for
>>>> the left join. That turns into a HashLeftOuterJoinResultSet at 
>>>> execution
>>>> time. And that, in turn, involves having the Store create and fill a
>>>> BackingStoreHashTableFromScan.
>>>>
>>>> The BackingStoreHashTableFromScan is created with a scanColumnList (a
>>>> FormatableBitSet) which specifies some actual columns in the row as 
>>>> well
>>>> as a trailing column position which is meant to represent the
>>>> RowLocation. That trailing column position is represented as 1 plus 
>>>> the
>>>> actual row length. BackingStoreHashTableFromScan doesn't know what to
>>>> make of that column position and silently ignores it. So clearly 
>>>> either
>>>> that's the wrong api for asking for RowLocations or
>>>> BackingStoreHashTableFromScan needs to be taught some new tricks.
>>>>
>>>> So the question is this: what's the right way to ask
>>>> BackingStoreHashTableFromScan to build a hash table whose rows contain
>>>> some set of real column positions plus a trailing RowLocation 
>>>> column? I
>>>> may stumble into other situations where I need to ask a scan to put
>>>> RowLocations into the rows it returns. So it would be good to have a
>>>> general pattern here for requesting this special column.
>>>>
>>>> Thanks,
>>>> -Rick
>>>>
>>>>
>>>
>>>
>>
>>
>
>