db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Matrigali <mikem_...@sbcglobal.net>
Subject Re: URGENT!!! JDBC SQL query taking long time for large IN clause
Date Wed, 08 Apr 2009 16:49:44 GMT
Arindam Bhattacharjee wrote:
> Few inputs/observations:
> 
> a) This is not a school project. This is a project which makes money for 
> us today. We have functioning product running today, developed by us 
> which doesn't use any RDBMS's to achieve what I am trying to achieve 
> with the tables. For 200,000 object master we get 45 - 60 millis 
> performance. The reason we are trying to use an RDBMS is to ensure that 
> we can scale even higher and updates are easier and more maintenable. 
> Hence all these questions.
> 
> b) I tried removing the "preserve rows" from the declare temp table 
> clause - and that didn't help me either.
> 
> c) The data base which I have created is just a sample database which 
> vaguely represents the load which we will see if we implement the 
> existing feature of ours, using Derby RDBMS. It doesn't contain REAL data.
> 
For you and others, one way you could help the derby community help you 
is to post complete test cases if possible.   Issues like this could get
more attention if complete programs are submitted that build the db, 
load fake data and include the exact problem queries.  Put it in the
form of a junit test and it may end up in the nightly suite and then it
is likely no future release of derby will break your query.

I am glad the performance increased, but the compress issue is still
a mystery.  I guess it could be a different query plan - posting a 10.4
query plan would be nice.  If in your
initial load you created indexes before inserts, then on average the
tree is likely to 1/2 filled leafs.  Compress and create index on the
other hand fill the leafs completely (the downside is that means more
splits when inserts come again). Posting another space table result
before and after the compress would be interesting.  Compressing also
has the affect of filling up the operating system I/O cache with the
database if you have enough memory on your machine - Derby reads through 
this cache so it could eliminate a lot of real I/O that Derby would
have to do if it were running on a cold machine, ie. after a reboot.

> d) The entire database was created in one shot, as I have stated 
> earlier, and records are populated in one go. That was the actual 
> problem - thanks to Knuth for indicating that as a possibility. When I 
> compress the tables - the quries start performing a LOT faster! 1 
> million in object master, 20 million in object category mapping and 10 K 
> in category master and 1000 in IN clause gives me 1100 millis speed now 
> - down from 30 seconds. That probably explains why the object master 
> index scans were taking nearly 30 seconds in the runtimestats which I 
> posted. Which is slow, but still manageable since the 100 and 500s are 
> much faster. However, that is still a lot slower than other databases we 
> are testing on. Since Derby is embedded we expected it to be much faster 
> since no IPC is required between the client driver and the network server.
> 
I agree in general that the suggested temp join table better fits the
SQL model for 1000's of IN terms.  Note you may get better performance
if you sort the data before you insert it into the temp table - it may
get better cache clustering when joining to indexes on same key in other
tables.

The temp table performance is a mystery.  Again you could help out the
derby community with filing a jira issue with the exact code you are 
using to get this performance.  There is no reason for inserts into
temp table to be slower than regular table.  Underlying temp tables are
the same as regular tables, just with some of the synchronous write
properties turned off and some special handling of the ddl.

> e) IN doesn't function properly beyond 1000 object ids, and we used a 
> regular table CREATE TABLE DUMMY (ID INTEGER PRIMARY KEY). Insertion of 
> 5000 records into this happened in 100 millis (after compression of the 
> tables) and the query took 4 seconds. With executeUpdate() or execute() 
> (as suggested) insertion was at least 200% slower. So the execute batch 
> works better for the inserts into the table for the payloads required by 
> us. The temp table query took 10 MINS (this time I waited patiently for 
> it to finish - there was nothing wrong there - the query just takes too 
> much time and I didn't have patience to wait for it to finish the first 
> time :) ) to execute while the regular table with primary key took just 
> 4 seconds.
> 
It is good to hear 10.4 is better at this.  I know work was done to 
reduce the compile memory usage for really large queries - i don't 
remember in exactly which release but definitely after 10.1.
Again I would
suggest moving to 10.4 for your application.  It is more likely any 
problems found will be fixed and released in 10.4 than 10.1.

> f) The stack overflow exception for "select ... values...union" was much 
> less of a problem in 10.4 compared to 10.1 since the memory usage is 
> tapering off to the -mx<Size In Megs> for the JVM. But, since it cannot 
> be parameterized it isn't of much use to us. IN clause based queries are 
> getting parsed with much lower memory footprints.
> 
> Best regards,
> 
> Arindam.
> 
> On Wed, Apr 8, 2009 at 2:39 AM, <derby@segel.com 
> <mailto:derby@segel.com>> wrote:
> 
> 
>      > -----Original Message-----
>      > From: Mike Matrigali [mailto:mikem_app@sbcglobal.net
>     <mailto:mikem_app@sbcglobal.net>]
>      > Sent: Tuesday, April 07, 2009 2:05 PM
>      > To: Derby Discussion
>      > Subject: Re: URGENT!!! JDBC SQL query taking long time for large
>     IN clause
>      >
>      > It is impossible to say what the performance of the query can be
>     without
>      > knowing exact values of all the values of the IN LIST.  But it is
>      > possible to get some idea assuming some worst case behavior, and from
>      > that I am going to guess you will never come close to 100ms with an
>      > uncached database, on hardware using some sort of standard disk based
>      > hard drive.
>      >
>      > I do think the query may go faster with index and query tweeking, but
>      > 100ms to an uncached db and non-clustered unique values in that
>     IN list
>      > is never going to go that fast.  Adding up just what is posted it
>     looks
>      > like this is a 1.2 gig db.
>      >
>     Drop the unnecessary indexes and you'll see the database size shrink
>     fast.
>     Also note that he's running this on a Windows XP laptop. Depending
>     on the
>     model of the lap top, you will have not only CPU issues but also
>     disk i/o
>     issues as well. (5400 rpm IDE as an example....)
> 
>     However, it is possible for the OP to get better performance, if not
>     realistically 100ms performance. (BTW where did 100ms come from? I'm
>     sorry
>     but this really sounds like a class project...)
> 
>      > You posted the space for the tables and indexes.  The interesting
>     ones
>      > are the big ones.  You have 5 tables or indexes over 1000 pages
>     big.  If
>      > in the worst case your 1000 value IN list happens to be on 1000
>      > different pages then Derby is going to need to do at least 1000
>     i/o's to
>      > get to them - I usually use back of envelope of max 100 i/o's per
>     second
>      > (even if your disk has specs that say higher rate this I/O is not
>      > going to
>      > get streamed as fast as possible by this query, it is going to
>     ask for
>      > page, process it, do some join work then later ask for another
>     page, ...)
>      > :
>      > > CATEGORY_MASTER            0    103    0    0    4096    0
>      > > SQL090406091302600        1    55    0    0    4096    0
>      > > SQL090406091302601        1    160    0    1    4096    0
>      > > SQL090406091302730        1    1    0    1    4096    0
>      > > OBJECT_MASTER            0    10497    0    0    4096    0
>      > > SQL090406091302760        1    5340    0    1    4096    0
>      > > SQL090406091302761        1    16708    0    410    4096    0
>      > > OBJECT_CATEGORY_MAPPING        0    150794    0    0    4096    0
>      > > OBJECT_CATEGORY_MAPPING_INDEX    1    112177    0    57    4096
>        0
>      >
> 
>     Mike,
> 
>     I think that a lot of this information is a bit skewed. Outside of the
>     primary index, the indexes he created included the varchar field.
>     Not sure
>     why he did this except under the impression that he'd only have to
>     hit the
>     index and not the underlying table. While there is some potential
>     merit to
>     this, I think that there are things that he can do to improve
>     performance.
>     (Hence my post about reworking the query itself and using a temp table.)
>     Drop those indexes and you'll see a big change in database size.
> 
>      > There was work done in 10.3 on IN-LISTS, making them perform more
>     like
>      > unions,  See DERBY-47.  So if you have a choice of releases I would
>      > suggest you move to 10.4 and post query plan and results against
>     that.
>      > The basic idea of that change was to allow the
>      > system to do 1 probe into an index for each value in the IN-LIST,
>     before
>      > this change DERBY could only sort the values in the IN list and then
>      > limit a index scan to the lowest and biggest values in the in list.
>      > So for instance for OBJECT_CATEGORY_MAPPING_INDEX, worst case it
>     might
>      > have to scan 112177 pages to find the 1000 rows, where worst case for
>      > probing would be 1000 page (plus btree parent index pages, but those
>      > are much more likely cached).  The problem is that there is
>     definitely
>      > overhead for probing one at a time, scans go much faster - so
>     there is
>      > a crossover point - ie. I would guess it would likely better to
>     scan all
>      > 112177 pages then do 100,000 probes.
>      >
>     I believe that it was already recommended that he do just that.
>     There are two ways he could use the temp table. As a sub-select
>     statement,
>     or as part of the table join.
> 
>     I think this would bypass the whole use of the IN list. I'm still
>     not 100%
>     sure why there's 100+ values coming from an outside source. Based on his
>     query below it looks like the object_ids in the IN clause are not
>     unique...
> 
>     Its kind of hard trying to help someone when you don't know the whole
>     problem....
> 
>     -Mike
> 
>      > arindam.bhattacharjee wrote:
>      > > Hello Knut,
>      > >
>      > > Thanks for your quick response. This is a sample database which
>     I have
>      > > created just for testing out the performance and has been
>     written to
>      > only
>      > > once in one go. I tried temp tables but that is just too slow.
>     The IN
>      > clause
>      > > has values which comes from another source and I can't modify that.
>      > >
>      > > However, I will try out what you state below. But still, I
>     wanted to get
>      > > your pulse about whether Derby can respond in sub 100 millisec
>     time with
>      > the
>      > > table sizes you see above?
>      > >
>      > > I find that:
>      > >
>      > > select category_master.category_name,
>      > count(category_master.category_name)
>      > > as category_count
>      > > from
>      > >     (
>      > >             select internal.object_id
>      > >             from
>      > >             (
>      > >                     values(1001) union all
>      > >                     values(1001) union all
>      > >                     values(1001) union all
>      > >                     values(1001) union all
>      > >                     values(1002) union all
>      > >                     values(1001) union all
>      > >                     values(1001) union all
>      > >                     values(1001) union all
>      > >                     values(1001) union all
>      > >                     values(1001) union all
>      > >                     values(1001) union all
>      > >                     values(1001) union all .......
>      > >                     values(9999)
>      > >             ) as internal(object_id)
>      > >
>      > >     ) as external_ids,
>      > >     object_master,
>      > >     category_master,
>      > >     object_category_mapping
>      > > where
>      > >     external_ids.object_id = object_master.object_id and
>      > >     external_ids.object_id = object_category_mapping.object_id and
>      > >     object_master.object_id = object_category_mapping.object_id and
>      > >     category_master.category_id =
>     object_category_mapping.category_id
>      > > group by
>      > >     category_master.category_name
>      > > order by
>      > >     category_count desc
>      > >
>      > > is much faster unfortunately connection.prepareStatement() is
>     taking way
>      > too
>      > > much memory (both stack and heap - I have a constraint of 256
>     MB MAX
>      > memory
>      > > for my JVM) which goes beyond my applications resources. Is
>     there a way
>      > I
>      > > can precompile some SQLs which are very expensive to parse during
>      > execution.
>      > >
>      > > Best regards,
>      > >
>      > > Arindam.
>      > >
>      > >
>      > > Knut Anders Hatlen wrote:
>      > >> "arindam.bhattacharjee" <mr.arindam.bhattacharjee@gmail.com
>     <mailto:mr.arindam.bhattacharjee@gmail.com>> writes:
>      > >>
>      > >>> Hello,
>      > >>>
>      > >>> I would like my query below to return within 100 millisecs.
>     Please
>      > help
>      > >>> me,
>      > >>> and the values for the IN clause comes from outside hence cannot
>      > really
>      > >>> change the IN clause to a join on an existing table.
>      > >> Hi Arindam,
>      > >>
>      > >> Does the query run faster if you compress all the tables
>     involved, or
>      > if
>      > >> you drop and recreate all the indexes? If so, it is likely
>     that the
>      > >> index cardinality statistics are out of date, which may make the
>      > >> optimizer pick a bad execution plan. Currently, index cardinality
>      > >> statistics are only updated at index creation time, when
>     tables are
>      > >> compressed, and when columns are dropped. A more automatic
>     solution is
>      > >> being worked on. For more details, see:
>      > >>
>      > >> https://issues.apache.org/jira/browse/DERBY-269
>      > >> https://issues.apache.org/jira/browse/DERBY-3788
>      > >> http://db.apache.org/derby/docs/10.4/tuning/ctunstats18908.html
>      > >>
>      > >> You may be experiencing some other problem, but this is a
>     problem that
>      > >> keeps coming up, so I think it's worth checking.
>      > >>
>      > >> Hope this helps,
>      > >>
>      > >> --
>      > >> Knut Anders
>      > >>
>      > >>
>      > >
> 
> 
> 
> 


Mime
View raw message