db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rick Hillegas <Richard.Hille...@Sun.COM>
Subject Re: Question about TableFunctions in Derby
Date Mon, 24 Aug 2009 16:54:34 GMT
Chris Goodacre wrote:
> Rick, your response is exactly as I would have expected.  I assumed the filtering of
data based on the join would have to occur "above" the individual queries (though admittedly,
I'm no expert in RDBMS theory).  
>
> So how do we / I proceed at this point?   I'm new to this project/list.  Shall I enter
this as an enhancement request in Jira?
Thanks Chris. That would be a great next step. Then we can continue 
discussion on that JIRA and on the developer's list. I can attach a 
functional spec to the issue-hopefully later this week. I'd like to 
spend some time on this for the next release and I think that my current 
management would be OK with that. However, I need to warn you that my 
company is in the process of being acquired and I don't know what the 
new management's priorities will be.
>    Do you want/need any help with the actual enhancement?
>   
Yes, please. Can you volunteer for any of the following:

1) Review the functional spec.

2) Test-drive snapshots of the implementation as they become available.

3) Write some of the code--with of course lots of support from me and 
other Derby developers.

Thanks,
-Rick
> -chris   
>
>
>
> ----- Original Message ----
> From: Rick Hillegas <Richard.Hillegas@Sun.COM>
> To: Derby Discussion <derby-user@db.apache.org>
> Sent: Monday, August 24, 2009 10:29:47 AM
> Subject: Re: Question about TableFunctions in Derby
>
> Hi Chris,
>
> A response inline...
>
> Chris Goodacre wrote:
>   
>> Rick,
>>
>> I think a String[] would be just fine.   While I am sure that a String[] is probably
slower than a BitSet, I did a quick test, looking at different comparisons, and matching 205
strings out of 1,646,085 (even using equalsIgnoreCase) only takes about 39 ms on my machine.
>>
>> I think we've completed the first example.  I have one other that I'd like to look
at:
>>
>> select h.price from TABLE ( legacy_realty_data() ) h, chicago_zipcodes z
>> where h.zipcode = z.zipcode
>>
>>
>> In this case, I would get a single String in referencedColumns array.  What (if anything)
would be in the ScanQualifier[]?
>>  
>>     
> I think that the ScanQualifier[] argument would be null in the first 
> increment of this work. The optimizer has a strong bias to put the table 
> function in the outermost slot of the join order. In order to limit the 
> table function scan with the "h.zipcode = z.zipcode" clause, the 
> optimizer would need some better heuristics for figuring out when to put 
> the table function in an inner slot. That could be a follow-on effort.
>
> Hope this seems sensible and useful to you,
> -Rick
>   
>> chris
>>
>>
>>
>> ----- Original Message ----
>> From: Rick Hillegas <Richard.Hillegas@Sun.COM>
>> To: Derby Discussion <derby-user@db.apache.org>
>> Sent: Friday, August 21, 2009 2:07:01 PM
>> Subject: Re: Question about TableFunctions in Derby
>>
>> Hi Chris,
>>
>> Thanks for the quick response. Some comments inline...
>>
>> Chris Goodacre wrote:
>>  
>>     
>>> Ok  - I think I get the basics of your approach:   rather than alter the signature
of the table function (which is arbitrary and bound as part of the create function statement),
you are (optionally) asking the developer to implement a hook that derby calls prior to actually
requesting data (the initScan method).    So far, so good.  I like the approach.
>>>
>>> I have a couple of areas I want to make sure I understand:
>>>
>>> (a) The FormatableBitSet would be a column-ordered based set of flags indicating
whether the column was included in the select statement, correct?   The column index would
presumably be based on the order of occurrence of the column in the returns table () subclause
of the create function statement.   This introduces a dependency into my code, but I don't
have enough experience to know whether I just got lucky not needing the dependency yet.  Maybe
other common implementations of the table function/result set already need to know what order
the columns were defined in the table function.
>>>  
>>>    
>>>       
>> Would it be better if referencedColumns were a String[]? That's fine too. There's
only one subtlety with that approach: what you would get would be an array of SQL identifiers.
That means that the strings would be uppercased unless you bothered to double-quote the column
names at CREATE FUNCTION time.
>>
>> Would some other type be better for referencedColumns?
>>  
>>     
>>> (b) I get steps 1 & 2, but in step 3, are you saying that would be an alteration
of the current create function statement (like either of the examples below), or a separate
step entirely, or neither?
>>>  
>>>    
>>>       
>> I was just being pedantic. The CREATE FUNCTION statement would not change. It would
look exactly like it does today.
>>
>> Hope we're converging on something useful.
>>
>> Thanks,
>> -Rick
>>  
>>     
>>>      e.g1:   create function legacy_realty_data()                 returns table
(                      postal_code varchar(10),                      price numeric(10,2) 
               )                 language java                 parameter style DERBY_SCANNABLE_JDBC_RESULT_SET
// flags implementation of this DerbyScan interface
>>>                 no sql
>>>                 external name 'com.cg.example.LegacyHousingDataFunction.read'
>>>
>>>     e.g2: create function legacy_realty_data()                 returns table
(                      postal_code varchar(10),                      price numeric(10,2) 
               )                 language java                 parameter style DERBY_JDBC_RESULT_SET
>>>                 no sql
>>>                 external name 'com.cg.example.LegacyHousingDataFunction.read'
returns DerbyScan 
>>>
>>> Do I understand what you're proposing?  
>>> -chris
>>>
>>> ----- Original Message ----
>>> From: Rick Hillegas <Richard.Hillegas@Sun.COM>
>>> To: Derby Discussion <derby-user@db.apache.org>
>>> Sent: Friday, August 21, 2009 10:12:42 AM
>>> Subject: Re: Question about TableFunctions in Derby
>>>
>>> Hi Chris,
>>>
>>> Thanks for following up on this. How does something like the following sound?
I'm going to phrase this in terms of existing classes. I suspect that we'd want to use some
simpler classes if we implemented this--for instance, I'm not happy with ScanQualifier: as
your example points out, ScanQualifier pulls in the Derby type system, which is a lot of machinery
you don't need to worry about. But here's a sketch:
>>>
>>> 1) Derby would expose some interface which your ResultSet would implement:
>>>
>>> public interface DerbyScan
>>> {
>>>    /**
>>>      * Setup that is called before Derby calls next() to get the first row
>>>      *
>>>      * @param referencedColumns Columns which Derby may retrieve. These are column
positions as declared at CREATE FUNCTION time.
>>>      * @param restriction Array of simple comparisons to constant values. Each
comparison applies to a single column.
>>>      */
>>>    public void initScan( FormatableBitSet referencedColumns, ScanQualifier[]
restriction ) throws SQLException;
>>> }
>>>
>>> 2) You would code something like this:
>>>
>>> public class MyResultSet implements ResultSet, DerbyScan { ... }
>>>
>>> 3) And your CREATE FUNCTION statement would bind a function to a method like
the following, which you would code also:
>>>
>>> public static MyResultSet legacyRealtyData() throws SQLException { ... }
>>>
>>>
>>> Is this headed in the right direction?
>>>
>>> Thanks,
>>> -Rick
>>>
>>> Chris Goodacre wrote:
>>>  
>>>    
>>>       
>>>> Rick,
>>>> Sorry it's taken me so long to reply on this.  I just today got back to this
in earnest.   I'll try to walk through an example, imagining that I have an array of ScanQualifiers
that gets passed to my table function's method, just to make sure I understand this.
>>>>
>>>> public static ResultSet read(ScanQualifier[] qualifiers) {
>>>>    // ... impl
>>>> }
>>>>
>>>> So, if I were to go back to my original example:
>>>>
>>>> select house_number, street, city from table (legacy_realty_data()) where
price < 500000
>>>>
>>>> a) I think that an array with only a single ScanQualifier object would be
passed to my read(...) method.
>>>> b) I can see where the operator for the ScanQualifier object would be some
negative number
>>>> c) The column id would reference the column # (basically) of the price column
from the table definition of my CREATE FUNCTION statement.
>>>> d) The result of getOrderable() on the scanqualifier object would return
me a DataValueDescriptor.  e) I could interrogate the DataValueDescriptor to get the value
(500000) in a type/manner that I could use to pass on to my legacy system
>>>>
>>>> I could use this information to restrict the number of rows that come back.
  That's good.
>>>>
>>>> It would still be nice if I could restrict the number of columns I'm requesting
up front.   It's expensive to go back and forth to this system, so I would rather make one
read (all relevant rows, all relevant columns) and take the chance that the user only uses
some of the rows from the result set.
>>>>
>>>> Would it be possible to use a ScanQualifier (or something like it) to inform
the table procedure methods which specific (non-calculated) columns are in the query?  -chris
>>>>
>>>>
>>>>
>>>> ----- Original Message ----
>>>> From: Rick Hillegas <Richard.Hillegas@Sun.COM>
>>>> To: Derby Discussion <derby-user@db.apache.org>
>>>> Sent: Monday, July 20, 2009 3:08:31 PM
>>>> Subject: Re: Question about TableFunctions in Derby
>>>>
>>>> Hi Chris,
>>>>
>>>> Reducing the number of column probes may be possible without any changes
to Derby: When your ResultSet is asked to get a column, it can remember that request. On later
rows, your ResultSet can ask the external data source for all of the column positions it has
remembered so far. In the query you gave, this would play out like this:
>>>>
>>>> 1) On the first row, your ResultSet would make 3 calls to the external data
source, one for each column. But the ResultSet would remember which columns were requested.
>>>>
>>>> 2) For each of the remaining N-1 rows, your ResultSet would call the external
data source only once, asking the external data source for all three columns in a single batch.
That batch could then be cached and the individual columns could be returned to Derby when
Derby called the getXXX() methods.
>>>>
>>>> Positioning and restricting the rows themselves (the WHERE clause fragments)
is tricker. It probably requires help from Derby, as you suggest. We could design some interface
by which Derby would pass the ResultSet a list of org.apache.derby.iapi.sql.execute.ScanQualifier.
Your ResultSet could then forward those directives to the external data source.
>>>>
>>>> What do you think?
>>>> -Rick
>>>>
>>>>
>>>> Chris Goodacre wrote:
>>>>      
>>>>      
>>>>         
>>>>> Rick, thanks for your suggestions.   Perhaps I am being obtuse, but when
you say ...
>>>>>
>>>>> "Since you have only asked for 3 columns, that's all that Derby will
>>>>> request from the ResultSet instantiated by your table function. That
>>>>> is, Derby is only going to call ResultSet.getXXX() on the house_number,
>>>>> street, and city columns. That should behave efficiently provided that
>>>>> your ResultSet is smart enough to only fault-in columns for which a
>>>>> getXXX() is called."
>>>>>
>>>>> Does that mean that I make a separate request to the legacy system each
time getXXX() is called - i.e. lazily initialize each column in the result set?    I think
this has to be the only way to do it, since I don't know which columns will be requested at
the time the read() method of my tablefunction is invoked.  Making (in this case) 3 calls
to the legacy system to get 1 column for N rows is certainly better than making 1 call to
the legacy system to get 1000 columns for N rows and then throwing away 997*N values/cells,
but still not quite as nice as I'd like.  If I were making a wish - I'd wish for some sort
of parsed representation of the query get passed to the read method (or to some other method
- similar to, or even as part of, the query optimization interface).    Ideally, this structured
representation would have the list of columns belonging to the table function from the select
list, and the where clause components specific to the table
>>>>>      
>>>>>        
>>>>>           
>>>  function only (i.e. mytablefunction.price > 50000 but NOT mytablefunction.price
< myrealtable.value).
>>>  
>>>    
>>>       
>>>>> In the absence of this, when the VTIResultSet class passes the ActivationHolder
to the derby class which invokes the read() method reflectively, why can't that class pass
the activation context (it knows it is dealing with a derby table function, it knows the class
name, it has access to the result set descriptor, if not the where clause) pass this information
along to the user's table function class?   I would happily implement an interface in this
class (not sure why read() has to be static) to get this information prior to resultset instantiation.
>>>>>
>>>>> -chris
>>>>>
>>>>>
>>>>>
>>>>> ----- Original Message ----
>>>>> From: Rick Hillegas <Richard.Hillegas@Sun.COM>
>>>>> To: Derby Discussion <derby-user@db.apache.org>
>>>>> Sent: Monday, July 20, 2009 10:55:33 AM
>>>>> Subject: Re: Question about TableFunctions in Derby
>>>>>
>>>>> Hi Chris,
>>>>>
>>>>> Some comments inline...
>>>>>
>>>>> Chris Goodacre wrote:
>>>>>            
>>>>>        
>>>>>           
>>>>>> I've read the Derby developer's guide and Rick Hillegas's informative
white paper (http://developers.sun.com/javadb/reference/whitepapers/sampleTableFunctions/doc/TableFunctionsWhitePaper.html)
on Table Functions, but am still struggling with the following issue:
>>>>>>
>>>>>> I am trying to create an RDB abstraction for a large CICS/VSAM-based
legacy system and blend it with our newer, RDB-based tier.  This seems like a good application
of TableFunctions.  The VSAM data is made available to me via an IP-based proprietary messaging
interface.  There are lots of different files here, but due to some historical forces, most
of the data I'm interested in resides in 4 VSAM files.
>>>>>>
>>>>>> Unfortunately, each of those VSAM files has over a 1000 fields in
it.
>>>>>>
>>>>>> Now eventually, it might be possible to fully model a single VSAM
file into (for the sake of argument) 50 tables; each table/row representing a small slice
of a single VSAM record.
>>>>>>
>>>>>> In the meantime, for both this proof-of-concept and as a migration
path to our existing clients, I'd like to represent each VSAM file as a table (subject to
the 1024 column SQL limitation per table).  This will be a highly-denormalized and decidedly
non-relational view of the data, but it will be easy to demonstrate and immediately recognizable
to our customers.
>>>>>>
>>>>>> However, I can't seem to get around the problem of data granularity.
 For example, if my customer executes:
>>>>>>
>>>>>> select house_number, street, city from table (legacy_realty_data())
where price < 500000
>>>>>>                    
>>>>>>          
>>>>>>             
>>>>> Since you have only asked for 3 columns, that's all that Derby will request
from the ResultSet instantiated by your table function. That is, Derby is only going to call
ResultSet.getXXX() on the house_number, street, and city columns. That should behave efficiently
provided that your ResultSet is smart enough to only fault-in columns for which a getXXX()
is called.
>>>>>
>>>>> The WHERE clause is a little trickier. You are right, Derby will read
all rows from the ResultSet and throw away the rows which don't satisfy the WHERE clause.
What you want to do is push the qualification through the table function to the external data
source. I don't see any way to do this other than adding some more arguments to your table
function. For instance, if you could push the qualification through to the external data source,
then you could get efficient behavior from something like the following:
>>>>>
>>>>> select house_number, street, city
>>>>> from table( legacy_realty_data( 500000 ) ) s;
>>>>>
>>>>> Hope this helps,
>>>>> -Rick
>>>>>
>>>>>            
>>>>>        
>>>>>           
>>>>>> I don't appear to have any visibility to the actual query inside
my legacy_realty_data TableFunction, so I have to go get all 1000 fields for however many
listings are present where price< 500000 even though only three columns will be requested.
 Am I missing something?  Aside from having the user repeat the columns as parameters to the
table function (which looks awkward to say the least), I can't see a way around this based
on my limited knowledge of Derby.
>>>>>>
>>>>>> Is there a way to only retrieve the columns that the user is querying
for?
>>>>>>
>>>>>> Looking forward to your help/advice.
>>>>>>
>>>>>> -chris
>>>>>>
>>>>>>                    
>>>>>>          
>>>>>>             



Mime
View raw message