Mailing-List: contact derby-user-help@db.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Derby Discussion" <derby-user@db.apache.org>
Received-SPF: pass (nike.apache.org: local policy)
Reply-To: <msegel@segel.com>
From: <derby@segel.com>
Sender: "Michael Segel" <msegel@segel.com>
To: "'Derby Discussion'" <derby-user@db.apache.org>
Cc: <mikem_app@sbcglobal.net>
Subject: RE: URGENT!!! JDBC SQL query taking long time for large IN clause
Date: Tue, 7 Apr 2009 16:09:55 -0500
Organization: MSCC
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit
In-Reply-To: <49DBA3D4.8010803@sbcglobal.net>
Thread-Index: Acm3tHMc0+f9HaslQnKsfHjf0b4dAwAC1maw
Message-Id: <20090407211427.C781D5DD84@dbrack01.segel.com>


> -----Original Message-----
> From: Mike Matrigali [mailto:mikem_app@sbcglobal.net]
> Sent: Tuesday, April 07, 2009 2:05 PM
> To: Derby Discussion
> Subject: Re: URGENT!!! JDBC SQL query taking long time for large IN clause
> 
> It is impossible to say what the performance of the query can be without
> knowing exact values of all the values of the IN LIST.  But it is
> possible to get some idea assuming some worst case behavior, and from
> that I am going to guess you will never come close to 100ms with an
> uncached database, on hardware using some sort of standard disk based
> hard drive.
> 
> I do think the query may go faster with index and query tweeking, but
> 100ms to an uncached db and non-clustered unique values in that IN list
> is never going to go that fast.  Adding up just what is posted it looks
> like this is a 1.2 gig db.
> 
Drop the unnecessary indexes and you'll see the database size shrink fast.
Also note that he's running this on a Windows XP laptop. Depending on the
model of the lap top, you will have not only CPU issues but also disk i/o
issues as well. (5400 rpm IDE as an example....)

However, it is possible for the OP to get better performance, if not
realistically 100ms performance. (BTW where did 100ms come from? I'm sorry
but this really sounds like a class project...)

> You posted the space for the tables and indexes.  The interesting ones
> are the big ones.  You have 5 tables or indexes over 1000 pages big.  If
> in the worst case your 1000 value IN list happens to be on 1000
> different pages then Derby is going to need to do at least 1000 i/o's to
> get to them - I usually use back of envelope of max 100 i/o's per second
> (even if your disk has specs that say higher rate this I/O is not
> going to
> get streamed as fast as possible by this query, it is going to ask for
> page, process it, do some join work then later ask for another page, ...)
> :
> > CATEGORY_MASTER            0    103    0    0    4096    0
> > SQL090406091302600        1    55    0    0    4096    0
> > SQL090406091302601        1    160    0    1    4096    0
> > SQL090406091302730        1    1    0    1    4096    0
> > OBJECT_MASTER            0    10497    0    0    4096    0
> > SQL090406091302760        1    5340    0    1    4096    0
> > SQL090406091302761        1    16708    0    410    4096    0
> > OBJECT_CATEGORY_MAPPING        0    150794    0    0    4096    0
> > OBJECT_CATEGORY_MAPPING_INDEX    1    112177    0    57    4096    0
> 

Mike,

I think that a lot of this information is a bit skewed. Outside of the
primary index, the indexes he created included the varchar field. Not sure
why he did this except under the impression that he'd only have to hit the
index and not the underlying table. While there is some potential merit to
this, I think that there are things that he can do to improve performance.
(Hence my post about reworking the query itself and using a temp table.)
Drop those indexes and you'll see a big change in database size.

> There was work done in 10.3 on IN-LISTS, making them perform more like
> unions,  See DERBY-47.  So if you have a choice of releases I would
> suggest you move to 10.4 and post query plan and results against that.
> The basic idea of that change was to allow the
> system to do 1 probe into an index for each value in the IN-LIST, before
> this change DERBY could only sort the values in the IN list and then
> limit a index scan to the lowest and biggest values in the in list.
> So for instance for OBJECT_CATEGORY_MAPPING_INDEX, worst case it might
> have to scan 112177 pages to find the 1000 rows, where worst case for
> probing would be 1000 page (plus btree parent index pages, but those
> are much more likely cached).  The problem is that there is definitely
> overhead for probing one at a time, scans go much faster - so there is
> a crossover point - ie. I would guess it would likely better to scan all
> 112177 pages then do 100,000 probes.
> 
I believe that it was already recommended that he do just that.
There are two ways he could use the temp table. As a sub-select statement,
or as part of the table join.

I think this would bypass the whole use of the IN list. I'm still not 100%
sure why there's 100+ values coming from an outside source. Based on his
query below it looks like the object_ids in the IN clause are not unique...

Its kind of hard trying to help someone when you don't know the whole
problem.... 

-Mike

> arindam.bhattacharjee wrote:
> > Hello Knut,
> >
> > Thanks for your quick response. This is a sample database which I have
> > created just for testing out the performance and has been written to
> only
> > once in one go. I tried temp tables but that is just too slow. The IN
> clause
> > has values which comes from another source and I can't modify that.
> >
> > However, I will try out what you state below. But still, I wanted to get
> > your pulse about whether Derby can respond in sub 100 millisec time with
> the
> > table sizes you see above?
> >
> > I find that:
> >
> > select category_master.category_name,
> count(category_master.category_name)
> > as category_count
> > from
> > 	(
> > 		select internal.object_id
> > 		from
> > 		(
> > 			values(1001) union all
> > 			values(1001) union all
> > 			values(1001) union all
> > 			values(1001) union all
> > 			values(1002) union all
> > 			values(1001) union all
> > 			values(1001) union all
> > 			values(1001) union all
> > 			values(1001) union all
> > 			values(1001) union all
> > 			values(1001) union all
> > 			values(1001) union all .......
> > 			values(9999)
> > 		) as internal(object_id)
> >
> > 	) as external_ids,
> > 	object_master,
> > 	category_master,
> > 	object_category_mapping
> > where
> > 	external_ids.object_id = object_master.object_id and
> > 	external_ids.object_id = object_category_mapping.object_id and
> > 	object_master.object_id = object_category_mapping.object_id and
> > 	category_master.category_id = object_category_mapping.category_id
> > group by
> > 	category_master.category_name
> > order by
> > 	category_count desc
> >
> > is much faster unfortunately connection.prepareStatement() is taking way
> too
> > much memory (both stack and heap - I have a constraint of 256 MB MAX
> memory
> > for my JVM) which goes beyond my applications resources. Is there a way
> I
> > can precompile some SQLs which are very expensive to parse during
> execution.
> >
> > Best regards,
> >
> > Arindam.
> >
> >
> > Knut Anders Hatlen wrote:
> >> "arindam.bhattacharjee" <mr.arindam.bhattacharjee@gmail.com> writes:
> >>
> >>> Hello,
> >>>
> >>> I would like my query below to return within 100 millisecs. Please
> help
> >>> me,
> >>> and the values for the IN clause comes from outside hence cannot
> really
> >>> change the IN clause to a join on an existing table.
> >> Hi Arindam,
> >>
> >> Does the query run faster if you compress all the tables involved, or
> if
> >> you drop and recreate all the indexes? If so, it is likely that the
> >> index cardinality statistics are out of date, which may make the
> >> optimizer pick a bad execution plan. Currently, index cardinality
> >> statistics are only updated at index creation time, when tables are
> >> compressed, and when columns are dropped. A more automatic solution is
> >> being worked on. For more details, see:
> >>
> >> https://issues.apache.org/jira/browse/DERBY-269
> >> https://issues.apache.org/jira/browse/DERBY-3788
> >> http://db.apache.org/derby/docs/10.4/tuning/ctunstats18908.html
> >>
> >> You may be experiencing some other problem, but this is a problem that
> >> keeps coming up, so I think it's worth checking.
> >>
> >> Hope this helps,
> >>
> >> --
> >> Knut Anders
> >>
> >>
> >