db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rick Hillegas <rick.hille...@oracle.com>
Subject Re: Store api question: how to ask for RowLocations
Date Fri, 04 Oct 2013 18:37:33 GMT
Hi Mike,

I have attached a patch to DERBY-3155 which introduces 
BackingStoreHashtables which include RowLocation information: 
derby-3155-03-ae-backingStoreHashtableWithRowLocation.diff. I would 
appreciate your feedback.

Thanks,
-Rick

On 10/2/13 6:34 AM, Rick Hillegas wrote:
> Thanks for the quick response, Mike. Some more discussion inline...
>
> On 10/1/13 3:20 PM, mike matrigali wrote:
>> I don't have any good answers here, but maybe some places to look - and
>> some questions.
>>
>> Are you going to ever need RowLocations of rows in an index?  If so this
>> is going to be very new territory and Derby has never done that.  For 
>> a btree the RowLocation would be just the actual row as the location is
>> definined by the key - there is no other quick way given regular row
>> level locking as the row is free to move from page to page and slot to
>> slot.  The abstraction of RowLocation was designed to handle this as
>> we wanted to be able to support a btree base table if necessary, but
>> no actual implementation was ever done.
> The MERGE statement shouldn't need the RowLocations of index rows. 
> MERGE is only interested in the base rows.
>>
>> Usually when you see a reference in the code about a RowLocation being
>> at column "N +1" it is usually an index where the code assumes the
>> RowLocation at the end of the row is the Rowlocation of the 
>> associated row in the heap.  So might
>> be confusing if what you are looking for is the RowLocation of the
>> current row.  In the case of indexes this row location is actually
>> stored as the N+1 column so makes sense returning it in the row.
> Thanks. I can see that avoiding that pattern will reduce confusion.
>>
>> You might look at current interfaces that use the 
>> RowLocationRetRowSource.  I don't think any of these solve your 
>> current problem but may give insight in how it was handled in the 
>> past.  This looks like at least one approach in the
>> past to allow caller access to RowLocations from bulk type scans.  I 
>> think it is mostly used currently to scan table once and then build 
>> indexes.  In this case it is left up to the caller to maintain the
>> separate information about each row.
> Thanks, I'll take a look at that.
>>
>> Is there some write up on the algo needed for merge so that I could
>> understand the requirements of the interface.   I have not read up
>> on this project so if it is already documented just point me there.
> The issue is DERBY-3155. There's a functional spec attached to that 
> issue. The implementation is evolving as I feel my way forward. A high 
> level description of the approach I'm trying right now is described in 
> a 2013-08-20 comment on that issue. In a nutshell, this is it:
>
> o First run a left join to determine the list of rows which need to be 
> touched.
> o As the left join is processed, figure out which (if any) MERGE 
> action applies to each row. Each MERGE action will have its own 
> temporary table for buffering these rows.
> o Then use the temporary tables to drive the corresponding MERGE actions.
>
> The RowLocations are needed for the DELETE and UPDATE actions.
>>
>> At the interface level a key question is if the generic openScan 
>> interface needs to change, once that happens a lot of the other 
>> interfaces need to change also.  There are a lot of interfaces that
>> were added for better scan performance for a specific need so maybe
>> this is just another one.
>>
>> A clean interface that comes to mind would be to create a new class for
>> row return that is more than just Object[].  In this case it is likely
>> 2 fields: Object[] and RowLocation.  Then probably a new type of 
>> create hash table that create one loaded with these new types of rows.
>> And then alter the interfaces to build
>> this extra overhead if necessary.  I like this approach rather than 
>> adding the "fake" filed onto the end of the row as it avoids bugs
>> that incorrectly treat the field as a real field for such things as
>> hashing, sorting, duplicate key determination, ...
> Thanks, I like that approach.
>>
>> It is my understanding that hash tables are one of the key 
>> performance features of the system
>> currently so would be nice to not add overhead to the main line path
>> for this feature if possible.
> Agreed. That has been my approach so far.
>
> Thanks,
> -Rick
>>
>> On 10/1/2013 10:29 AM, Rick Hillegas wrote:
>>> I need some advice about how to design an api for requesting that the
>>> Store include RowLocations in the rows that it scans and hands back to
>>> the language layer.
>>>
>>> The immediate problem that I'm working on involves implementing the
>>> MERGE statement (DERBY-3155). Part of the implementation involves
>>> cooking up a left join between two tables. I need to get back
>>> RowLocations for the right table of that join. In a particular problem
>>> case which I'm examining, the optimizer picks a HashJoin strategy for
>>> the left join. That turns into a HashLeftOuterJoinResultSet at 
>>> execution
>>> time. And that, in turn, involves having the Store create and fill a
>>> BackingStoreHashTableFromScan.
>>>
>>> The BackingStoreHashTableFromScan is created with a scanColumnList (a
>>> FormatableBitSet) which specifies some actual columns in the row as 
>>> well
>>> as a trailing column position which is meant to represent the
>>> RowLocation. That trailing column position is represented as 1 plus the
>>> actual row length. BackingStoreHashTableFromScan doesn't know what to
>>> make of that column position and silently ignores it. So clearly either
>>> that's the wrong api for asking for RowLocations or
>>> BackingStoreHashTableFromScan needs to be taught some new tricks.
>>>
>>> So the question is this: what's the right way to ask
>>> BackingStoreHashTableFromScan to build a hash table whose rows contain
>>> some set of real column positions plus a trailing RowLocation column? I
>>> may stumble into other situations where I need to ask a scan to put
>>> RowLocations into the rows it returns. So it would be good to have a
>>> general pattern here for requesting this special column.
>>>
>>> Thanks,
>>> -Rick
>>>
>>>
>>
>>
>
>


Mime
View raw message