Return-Path: Delivered-To: apmail-db-derby-user-archive@www.apache.org Received: (qmail 62171 invoked from network); 14 Nov 2005 16:13:20 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 14 Nov 2005 16:13:20 -0000 Received: (qmail 13134 invoked by uid 500); 14 Nov 2005 16:13:19 -0000 Delivered-To: apmail-db-derby-user-archive@db.apache.org Received: (qmail 13107 invoked by uid 500); 14 Nov 2005 16:13:19 -0000 Mailing-List: contact derby-user-help@db.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: List-Post: List-Id: Reply-To: "Derby Discussion" Delivered-To: mailing list derby-user@db.apache.org Received: (qmail 13096 invoked by uid 99); 14 Nov 2005 16:13:18 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Nov 2005 08:13:18 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of kh@araxis.com designates 195.10.109.13 as permitted sender) Received: from [195.10.109.13] (HELO smtp.araxis.com) (195.10.109.13) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 14 Nov 2005 08:13:10 -0800 Received: (qmail 88950 invoked from network); 14 Nov 2005 16:12:55 -0000 Received: from unknown (HELO ?127.0.0.1?) (10.8.0.134) by smtp.araxis.com with SMTP; 14 Nov 2005 16:12:55 -0000 Message-ID: <4378B785.3070900@araxis.com> Date: Mon, 14 Nov 2005 16:12:53 +0000 From: Kevin Hore User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Derby Discussion Subject: Re: Poor query optimizer choices is making Derby unusable for large tables References: <4374B1E7.2050506@araxis.com> <437503FE.8010705@Sourcery.Org> In-Reply-To: <437503FE.8010705@Sourcery.Org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Hi Satheesh Satheesh Bandaram wrote: > Hi Kevin, > > Kevin Hore wrote: > > >>i) Does anyone have any plans to fix this problem? > > > Can you file an enhancement request for this? I think Derby could > improve it's handling of OR/IN clauses. Many databases don't optimize OR > clauses as much as possible, though some do better than others. It would > be great if Derby could internally process this as two different scans > (one for 'CONTACT' and another for 'ADD') and then combine the results. > Some databases can do this. However, the workaround suggested by Jeff L. > does this, though you have to rewrite the query. I've commented on the re-write solution elsewhere. Regarding an enhancement, I think DERBY-47 pretty much covers the problem, but I'll add a comment to that describing my circumstances. > > Satheesh > > >>ii) In the meantime, are there any work-arounds? I�d appreciate any >>suggestions that would decrease the execution time of my second query >>below (the one with with two search terms). Likewise, any general >>strategies for avoiding this problem with IN clauses would be >>appreciated. >> >> >>----PROBLEM DESCRIPTION---- >> >>Consider the table: >> >>CREATE TABLE tblSearchDictionary >>( >>ObjectId int NOT NULL, >>ObjectType int NOT NULL, >>Word VARCHAR(64) NOT NULL, >>WordLocation int NOT NULL, >>CONSTRAINT CONSd0e222 UNIQUE (ObjectId,ObjectType,Word,WordLocation) >>); >> >>This table has an index on each of the four columns, it also has the >>unique index across all four columns as defined above: >> >>CREATE INDEX tblSearchDictionaryObjectId ON tblSearchDictionary >>(ObjectId); >>CREATE INDEX tblSearchDictionaryObjectType ON tblSearchDictionary >>(ObjectType); >>CREATE INDEX tblSearchDictionaryWord ON tblSearchDictionary (Word); >>CREATE INDEX tblSearchDictionaryWordLocation ON tblSearchDictionary >>(WordLocation); >> >>The table contains about 260,000 rows. >> >>The following query selects all rows that match instances of string in >>the Word column. It sums the WordLocation column having grouped by the >>ObjectId column. >> >>SELECT ObjectId, SUM(WordLocation) AS Score >>FROM tblSearchDictionary >>WHERE Word = 'CONTACT' >>GROUP BY ObjectId; >> >>On my machine this will usually complete in an acceptable time of >>around 200ms. >> >>Now consider the following query which adds a second search term on >>the same column. >> >>SELECT ObjectId, SUM(WordLocation) AS Score >>FROM tblSearchDictionary >>WHERE Word = 'CONTACT' OR Word = 'ADD' >>GROUP BY ObjectId; >> >>This second query usually takes around 10000ms on my machine. My >>understanding from the Derby optimizer docs and DERBY-47 is that this >>is because Derby is re-writing the query along the following lines, >>and then choosing to do a table scan: >> >>SELECT ObjectId, SUM(WordLocation) AS Score >>FROM tblSearchDictionary >>WHERE >> Word IN ('CONTACT', 'ADD') >> AND Word >= 'ADD' >> AND Word <= 'CONTACT' >>GROUP BY ObjectId; >> >>The plan for the first query indicates that the tblSearchDictionaryWord >>index is used to perform an index scan. However, the plan for the second >>query indicates that the majority of the additional time is taken >>performing a table scan over the entire table, instead of making use of >>the indexes available. Our application uses IN quite frequently, so >>this optimizer behaviour would seem to present a significant problem. >> >>---QUERY PLAN FOR FIRST QUERY---- >> >>Statement Name: >> null >>Statement Text: >> SELECT >> ObjectId, >> SUM(WordLocation) AS Score >>FROM tblSearchDictionary >>WHERE >> Word = 'CONTACT' >>GROUP BY >> ObjectId >> >>Parse Time: 0 >>Bind Time: 0 >>Optimize Time: 16 >>Generate Time: 0 >>Compile Time: 16 >>Execute Time: 0 >>Begin Compilation Timestamp : 2005-11-11 12:28:52.765 >>End Compilation Timestamp : 2005-11-11 12:28:52.781 >>Begin Execution Timestamp : 2005-11-11 13:08:15.406 >>End Execution Timestamp : 2005-11-11 13:08:15.406 >>Statement Execution Plan Text: >>Project-Restrict ResultSet (5): >>Number of opens = 1 >>Rows seen = 93 >>Rows filtered = 0 >>restriction = false >>projection = true >> constructor time (milliseconds) = 0 >> open time (milliseconds) = 0 >> next time (milliseconds) = 0 >> close time (milliseconds) = 0 >> restriction time (milliseconds) = 0 >> projection time (milliseconds) = 0 >> optimizer estimated row count: 1.00 >> optimizer estimated cost: 226.00 >> >>Source result set: >> Grouped Aggregate ResultSet: >> Number of opens = 1 >> Rows input = 113 >> Has distinct aggregate = false >> In sorted order = false >> Sort information: >> Number of rows input=113 >> Number of rows output=93 >> Sort type=internal >> constructor time (milliseconds) = 0 >> open time (milliseconds) = 0 >> next time (milliseconds) = 0 >> close time (milliseconds) = 0 >> optimizer estimated row count: 1.00 >> optimizer estimated cost: 226.00 >> >> Source result set: >> Project-Restrict ResultSet (4): >> Number of opens = 1 >> Rows seen = 113 >> Rows filtered = 0 >> restriction = false >> projection = true >> constructor time (milliseconds) = 0 >> open time (milliseconds) = 0 >> next time (milliseconds) = 0 >> close time (milliseconds) = 0 >> restriction time (milliseconds) = 0 >> projection time (milliseconds) = 0 >> optimizer estimated row count: 118.00 >> optimizer estimated cost: 226.00 >> >> Source result set: >> Index Row to Base Row ResultSet for TBLSEARCHDICTIONARY: >> Number of opens = 1 >> Rows seen = 113 >> Columns accessed from heap = {0, 3} >> constructor time (milliseconds) = 0 >> open time (milliseconds) = 0 >> next time (milliseconds) = 0 >> close time (milliseconds) = 0 >> optimizer estimated row count: 118.00 >> optimizer estimated cost: 226.00 >> >> Index Scan ResultSet for TBLSEARCHDICTIONARY using index >>TBLSEARCHDICTIONARYWORD at read committed isolation level using share >>row locking chosen by the optimizer >> Number of opens = 1 >> Rows seen = 113 >> Rows filtered = 0 >> Fetch Size = 1 >> constructor time (milliseconds) = 0 >> open time (milliseconds) = 0 >> next time (milliseconds) = 0 >> close time (milliseconds) = 0 >> next time in milliseconds/row = 0 >> >> scan information: >> Bit set of columns fetched=All >> Number of columns fetched=2 >> Number of deleted rows visited=0 >> Number of pages visited=4 >> Number of rows qualified=113 >> Number of rows visited=114 >> Scan type=btree >> Tree height=3 >> start position: >> >= on first 1 column(s). >> Ordered null semantics on the following columns: >>0 >> stop position: >> > on first 1 column(s). >> Ordered null semantics on the following columns: >>0 >> qualifiers: >>None >> optimizer estimated row count: 118.00 >> optimizer estimated cost: 226.00 >> >> >>---QUERY PLAN FOR SECOND QUERY---- >> >>Statement Name: >> null >>Statement Text: >> SELECT >> ObjectId, >> SUM(WordLocation) AS Score >>FROM tblSearchDictionary >>WHERE >> Word = 'CONTACT' OR Word = 'ADD' >>GROUP BY >> ObjectId >> >>Parse Time: 0 >>Bind Time: 0 >>Optimize Time: 0 >>Generate Time: 15 >>Compile Time: 15 >>Execute Time: 4250 >>Begin Compilation Timestamp : 2005-11-11 13:16:17.578 >>End Compilation Timestamp : 2005-11-11 13:16:17.593 >>Begin Execution Timestamp : 2005-11-11 13:16:17.593 >>End Execution Timestamp : 2005-11-11 13:16:27.437 >>Statement Execution Plan Text: >>Project-Restrict ResultSet (5): >>Number of opens = 1 >>Rows seen = 100 >>Rows filtered = 0 >>restriction = false >>projection = true >> constructor time (milliseconds) = 0 >> open time (milliseconds) = 4250 >> next time (milliseconds) = 0 >> close time (milliseconds) = 0 >> restriction time (milliseconds) = 0 >> projection time (milliseconds) = 0 >> optimizer estimated row count: 1.00 >> optimizer estimated cost: 82959.49 >> >>Source result set: >> Grouped Aggregate ResultSet: >> Number of opens = 1 >> Rows input = 712 >> Has distinct aggregate = false >> In sorted order = false >> Sort information: >> Number of rows input=712 >> Number of rows output=593 >> Sort type=internal >> constructor time (milliseconds) = 0 >> open time (milliseconds) = 4250 >> next time (milliseconds) = 0 >> close time (milliseconds) = 0 >> optimizer estimated row count: 1.00 >> optimizer estimated cost: 82959.49 >> >> Source result set: >> Project-Restrict ResultSet (4): >> Number of opens = 1 >> Rows seen = 712 >> Rows filtered = 0 >> restriction = false >> projection = true >> constructor time (milliseconds) = 0 >> open time (milliseconds) = 0 >> next time (milliseconds) = 4219 >> close time (milliseconds) = 15 >> restriction time (milliseconds) = 0 >> projection time (milliseconds) = 0 >> optimizer estimated row count: 19200.45 >> optimizer estimated cost: 82959.49 >> >> Source result set: >> Project-Restrict ResultSet (3): >> Number of opens = 1 >> Rows seen = 40806 >> Rows filtered = 40094 >> restriction = true >> projection = false >> constructor time (milliseconds) = 0 >> open time (milliseconds) = 0 >> next time (milliseconds) = 4219 >> close time (milliseconds) = 15 >> restriction time (milliseconds) = 124 >> projection time (milliseconds) = 0 >> optimizer estimated row count: 19200.45 >> optimizer estimated cost: 82959.49 >> >> Source result set: >> Table Scan ResultSet for TBLSEARCHDICTIONARY at read >>committed >>isolation level using share row locking chosen by the optimizer >> Number of opens = 1 >> Rows seen = 40806 >> Rows filtered = 0 >> Fetch Size = 1 >> constructor time (milliseconds) = 0 >> open time (milliseconds) = 0 >> next time (milliseconds) = 4001 >> close time (milliseconds) = 15 >> next time in milliseconds/row = 0 >> >> scan information: >> Bit set of columns fetched={0, 2, 3} >> Number of columns fetched=3 >> Number of pages visited=2978 >> Number of rows qualified=40806 >> Number of rows visited=256001 >> Scan type=heap >> start position: >>null stop position: >>null qualifiers: >>Column[0][0] Id: 2 >>Operator: < >>Ordered nulls: false >>Unknown return value: true >>Negate comparison result: true >>Column[0][1] Id: 2 >>Operator: <= >>Ordered nulls: false >>Unknown return value: false >>Negate comparison result: false >> >> optimizer estimated row count: 19200.45 >> optimizer estimated cost: 82959.49 >> >>---------- >> >>Thanks in advance for any help! >> >>Kind regards, >> >> >>Kevin Hore >> >> >> > >