t sounded like your standard app tends to load a bunch of data and then
deliver those dbs for read only access.  Again this is a minor tweak but
you may see some improvement if you can sort the data by primary key for
insert into the base table.  This will give you a one time clustering that would improve index scans that require index to base table look up.
Thinking about this, I wonder if we should add an option to compress table to do this automatically?  It would be very easy to do (ie. a one time cluster - continual cluster would be a lot more work). All that would have to happen is at compress time throw all the rows into the sorter and then insert the result on the other side.
[Arindam] This is a great suggestion. Our DB is dynamic however only one modification owner can write to it. The rest will see a copy. We can just compress the master DB before we replicate it to the copies. By the way, I tested on mysql - it is processing 1000 records in IN clause for the same DB in 16 millis. I have increased the query page size to 16 MB. I will send the execution plan as I get them.

On Wed, Apr 8, 2009 at 10:31 PM, Mike Matrigali <mikem_app@sbcglobal.net> wrote:
Arindam Bhattacharjee wrote:
Thanks for the excellent summary Mike. The IN clause will have an unique list of object ids. If we can return 100 object ids in good time, we are okay. The users will be given an option to increase this to MAX 1000 - however, they will be warned that the performance would be slower. The real power of Derby according to us is the flexibility to manage the life cycle of the database files without support from DB admins. We want to make the DB management completely hidden from our customers. However, SolidDB and TimesTen are good options which we can investigate but Derby has the advantage of being used within our organization already. In general we liked its memory footprint - except for one case of SQL parsing, it was mostly well behaved.

Question: will changing page size for the tables improve query speed? We will always run one query through multiple threads on Derby and we will not use it for any other purpose. We are okay to be a little slow on the insertions but retrieval should be as quick and fast as possible. The database will be created and replicated. The replicated databases will be read-only.
I am not sure about page size but seems reasonable to try upping them all to 32k.  I usually leave page size to the very end of tuning.  Best is just to run the experiment.  You definitely have enough rows so
you won't be wasting any space going to bigger pages.  The trade off is
that i/o will be better but processing bigger pages requires more cpu
for things like binary searches on the btree pages.  But bigger pages may also shrink the height of the btree so again it is a tradeoff.
Another downside is that memory footprint of derby will increase as Derby currently sizes it's cache by number of pages so where you may
have 1000 (default cache size is 1000 pages) 4k pages cached you now might get 1000 32k pages.

It sounded like your standard app tends to load a bunch of data and then
deliver those dbs for read only access.  Again this is a minor tweak but
you may see some improvement if you can sort the data by primary key for
insert into the base table.  This will give you a one time clustering that would improve index scans that require index to base table look up.
Thinking about this, I wonder if we should add an option to compress table to do this automatically?  It would be very easy to do (ie. a one time cluster - continual cluster would be a lot more work). All that would have to happen is at compress time throw all the rows into the sorter and then insert the result on the other side.


Best regards,

Arindam.


On Wed, Apr 8, 2009 at 6:23 PM, <derby@segel.com <mailto:derby@segel.com>> wrote:

   
   Arindam,

   
   Ok,

   
   Just a few things…

   
   1)       Derby is a free, open source, product. Warts and all, its
   not going to be your best choice for a commercial rdbms product.
   (Sorry Derby fans, you get what you pay for.)

   2)       You want speed, you’re going to have to consider IBM’s
   SolidDB which is an in memory database along with Oracle’s TimesTen
   database. This will give you the speed that you want.

   3)       You’re running on a laptop where your disk drive could be a
   5400 rpm IDE drive. Laptop drive == low energy consumption and low
   performance. Disk based solutions will be much slower on a laptop
   than on a ‘comparable’ desktop and/or server.

   4)       I’m not sure why you had to load and then compress your
   tables. I could understand that if you created your index prior to
   loading the data that you could have to update the statistics.

   5)       Since we don’t know what you’re actually trying to do,
   there could be a better or more efficient design. Having an IN
   clause with 1000’s of entries is definitely not a good design for
   any database.

   6)       You never did answer the question about the object ids that
   were in the IN clause. Were they unique or were there duplicates?

   
   
   HTH

   
   -Mike

   ------------------------------------------------------------------------

   *From:* Arindam Bhattacharjee
   [mailto:mr.arindam.bhattacharjee@gmail.com
   <mailto:mr.arindam.bhattacharjee@gmail.com>]
   *Sent:* Wednesday, April 08, 2009 2:14 AM

   *To:* Derby Discussion; msegel@segel.com <mailto:msegel@segel.com>

   *Subject:* Re: URGENT!!! JDBC SQL query taking long time for large
   IN clause

   
   Few inputs/observations:

   a) This is not a school project. This is a project which makes money
   for us today. We have functioning product running today, developed
   by us which doesn't use any RDBMS's to achieve what I am trying to
   achieve with the tables. For 200,000 object master we get 45 - 60
   millis performance. The reason we are trying to use an RDBMS is to
   ensure that we can scale even higher and updates are easier and more
   maintenable. Hence all these questions.

   b) I tried removing the "preserve rows" from the declare temp table
   clause - and that didn't help me either.

   c) The data base which I have created is just a sample database
   which vaguely represents the load which we will see if we implement
   the existing feature of ours, using Derby RDBMS. It doesn't contain
   REAL data.

   d) The entire database was created in one shot, as I have stated
   earlier, and records are populated in one go. That was the actual
   problem - thanks to Knuth for indicating that as a possibility. When
   I compress the tables - the quries start performing a LOT faster! 1
   million in object master, 20 million in object category mapping and
   10 K in category master and 1000 in IN clause gives me 1100 millis
   speed now - down from 30 seconds. That probably explains why the
   object master index scans were taking nearly 30 seconds in the
   runtimestats which I posted. Which is slow, but still manageable
   since the 100 and 500s are much faster. However, that is still a lot
   slower than other databases we are testing on. Since Derby is
   embedded we expected it to be much faster since no IPC is required
   between the client driver and the network server.

   e) IN doesn't function properly beyond 1000 object ids, and we used
   a regular table CREATE TABLE DUMMY (ID INTEGER PRIMARY KEY).
   Insertion of 5000 records into this happened in 100 millis (after
   compression of the tables) and the query took 4 seconds. With
   executeUpdate() or execute() (as suggested) insertion was at least
   200% slower. So the execute batch works better for the inserts into
   the table for the payloads required by us. The temp table query took
   10 MINS (this time I waited patiently for it to finish - there was
   nothing wrong there - the query just takes too much time and I
   didn't have patience to wait for it to finish the first time :) ) to
   execute while the regular table with primary key took just 4 seconds.

   f) The stack overflow exception for "select ... values...union" was
   much less of a problem in 10.4 compared to 10.1 since the memory
   usage is tapering off to the -mx<Size In Megs> for the JVM. But,
   since it cannot be parameterized it isn't of much use to us. IN
   clause based queries are getting parsed with much lower memory
   footprints.

   Best regards,

   Arindam.

   On Wed, Apr 8, 2009 at 2:39 AM, <derby@segel.com
   <mailto:derby@segel.com>> wrote:


   >  -----Original Message-----
   >  From: Mike Matrigali [mailto:mikem_app@sbcglobal.net
   <mailto:mikem_app@sbcglobal.net>]
   >  Sent: Tuesday, April 07, 2009 2:05 PM
   >  To: Derby Discussion

   >  Subject: Re: URGENT!!! JDBC SQL query taking long time for large
   IN clause
   >

   >  It is impossible to say what the performance of the query can be
   without
   >  knowing exact values of all the values of the IN LIST.  But it is
   >  possible to get some idea assuming some worst case behavior, and from
   >  that I am going to guess you will never come close to 100ms with an
   >  uncached database, on hardware using some sort of standard disk based
   >  hard drive.
   >
   >  I do think the query may go faster with index and query tweeking, but
   >  100ms to an uncached db and non-clustered unique values in that IN
   list
   >  is never going to go that fast.  Adding up just what is posted it
   looks
   >  like this is a 1.2 gig db.
   >

   Drop the unnecessary indexes and you'll see the database size shrink
   fast.
   Also note that he's running this on a Windows XP laptop. Depending
   on the
   model of the lap top, you will have not only CPU issues but also
   disk i/o
   issues as well. (5400 rpm IDE as an example....)

   However, it is possible for the OP to get better performance, if not
   realistically 100ms performance. (BTW where did 100ms come from? I'm
   sorry
   but this really sounds like a class project...)


   >  You posted the space for the tables and indexes.  The interesting ones
   >  are the big ones.  You have 5 tables or indexes over 1000 pages
   big.  If
   >  in the worst case your 1000 value IN list happens to be on 1000
   >  different pages then Derby is going to need to do at least 1000
   i/o's to
   >  get to them - I usually use back of envelope of max 100 i/o's per
   second
   >  (even if your disk has specs that say higher rate this I/O is not
   >  going to
   >  get streamed as fast as possible by this query, it is going to ask for
   >  page, process it, do some join work then later ask for another
   page, ...)
   >  :
   >  > CATEGORY_MASTER            0    103    0    0    4096    0
   >  > SQL090406091302600        1    55    0    0    4096    0
   >  > SQL090406091302601        1    160    0    1    4096    0
   >  > SQL090406091302730        1    1    0    1    4096    0
   >  > OBJECT_MASTER            0    10497    0    0    4096    0
   >  > SQL090406091302760        1    5340    0    1    4096    0
   >  > SQL090406091302761        1    16708    0    410    4096    0
   >  > OBJECT_CATEGORY_MAPPING        0    150794    0    0    4096    0
   >  > OBJECT_CATEGORY_MAPPING_INDEX    1    112177    0    57    4096    0
   >

   Mike,

   I think that a lot of this information is a bit skewed. Outside of the
   primary index, the indexes he created included the varchar field.
   Not sure
   why he did this except under the impression that he'd only have to
   hit the
   index and not the underlying table. While there is some potential
   merit to
   this, I think that there are things that he can do to improve
   performance.
   (Hence my post about reworking the query itself and using a temp table.)
   Drop those indexes and you'll see a big change in database size.


   >  There was work done in 10.3 on IN-LISTS, making them perform more like
   >  unions,  See DERBY-47.  So if you have a choice of releases I would
   >  suggest you move to 10.4 and post query plan and results against that.
   >  The basic idea of that change was to allow the
   >  system to do 1 probe into an index for each value in the IN-LIST,
   before
   >  this change DERBY could only sort the values in the IN list and then
   >  limit a index scan to the lowest and biggest values in the in list.
   >  So for instance for OBJECT_CATEGORY_MAPPING_INDEX, worst case it might
   >  have to scan 112177 pages to find the 1000 rows, where worst case for
   >  probing would be 1000 page (plus btree parent index pages, but those
   >  are much more likely cached).  The problem is that there is definitely
   >  overhead for probing one at a time, scans go much faster - so there is
   >  a crossover point - ie. I would guess it would likely better to
   scan all
   >  112177 pages then do 100,000 probes.
   >

   I believe that it was already recommended that he do just that.
   There are two ways he could use the temp table. As a sub-select
   statement,
   or as part of the table join.

   I think this would bypass the whole use of the IN list. I'm still
   not 100%
   sure why there's 100+ values coming from an outside source. Based on his
   query below it looks like the object_ids in the IN clause are not
   unique...

   Its kind of hard trying to help someone when you don't know the whole
   problem....

   -Mike


   >  arindam.bhattacharjee wrote:
   >  > Hello Knut,
   >  >
   >  > Thanks for your quick response. This is a sample database which
   I have
   >  > created just for testing out the performance and has been written to
   >  only
   >  > once in one go. I tried temp tables but that is just too slow.
   The IN
   >  clause
   >  > has values which comes from another source and I can't modify that.
   >  >
   >  > However, I will try out what you state below. But still, I
   wanted to get
   >  > your pulse about whether Derby can respond in sub 100 millisec
   time with
   >  the
   >  > table sizes you see above?
   >  >
   >  > I find that:
   >  >
   >  > select category_master.category_name,
   >  count(category_master.category_name)
   >  > as category_count
   >  > from
   >  >     (
   >  >             select internal.object_id
   >  >             from
   >  >             (
   >  >                     values(1001) union all
   >  >                     values(1001) union all
   >  >                     values(1001) union all
   >  >                     values(1001) union all
   >  >                     values(1002) union all
   >  >                     values(1001) union all
   >  >                     values(1001) union all
   >  >                     values(1001) union all
   >  >                     values(1001) union all
   >  >                     values(1001) union all
   >  >                     values(1001) union all
   >  >                     values(1001) union all .......
   >  >                     values(9999)
   >  >             ) as internal(object_id)
   >  >
   >  >     ) as external_ids,
   >  >     object_master,
   >  >     category_master,
   >  >     object_category_mapping
   >  > where
   >  >     external_ids.object_id = object_master.object_id and
   >  >     external_ids.object_id = object_category_mapping.object_id and
   >  >     object_master.object_id = object_category_mapping.object_id and
   >  >     category_master.category_id =
   object_category_mapping.category_id
   >  > group by
   >  >     category_master.category_name
   >  > order by
   >  >     category_count desc
   >  >
   >  > is much faster unfortunately connection.prepareStatement() is
   taking way
   >  too
   >  > much memory (both stack and heap - I have a constraint of 256 MB MAX
   >  memory
   >  > for my JVM) which goes beyond my applications resources. Is
   there a way
   >  I
   >  > can precompile some SQLs which are very expensive to parse during
   >  execution.
   >  >
   >  > Best regards,
   >  >
   >  > Arindam.
   >  >
   >  >
   >  > Knut Anders Hatlen wrote:
   >  >> "arindam.bhattacharjee" <mr.arindam.bhattacharjee@gmail.com
   <mailto:mr.arindam.bhattacharjee@gmail.com>> writes:
   >  >>
   >  >>> Hello,
   >  >>>
   >  >>> I would like my query below to return within 100 millisecs. Please
   >  help
   >  >>> me,
   >  >>> and the values for the IN clause comes from outside hence cannot
   >  really
   >  >>> change the IN clause to a join on an existing table.
   >  >> Hi Arindam,
   >  >>
   >  >> Does the query run faster if you compress all the tables
   involved, or
   >  if
   >  >> you drop and recreate all the indexes? If so, it is likely that the
   >  >> index cardinality statistics are out of date, which may make the
   >  >> optimizer pick a bad execution plan. Currently, index cardinality
   >  >> statistics are only updated at index creation time, when tables are
   >  >> compressed, and when columns are dropped. A more automatic
   solution is
   >  >> being worked on. For more details, see:
   >  >>
   >  >> https://issues.apache.org/jira/browse/DERBY-269
   >  >> https://issues.apache.org/jira/browse/DERBY-3788
   >  >> http://db.apache.org/derby/docs/10.4/tuning/ctunstats18908.html
   >  >>
   >  >> You may be experiencing some other problem, but this is a
   problem that
   >  >> keeps coming up, so I think it's worth checking.
   >  >>
   >  >> Hope this helps,
   >  >>
   >  >> --
   >  >> Knut Anders
   >  >>
   >  >>
   >  >