t sounded like your standard app tends to load a bunch of data and then
deliver those dbs for read only access. Again this is a minor tweak but
you may see some improvement if you can sort the data by primary key for
insert into the base table. This will give you a one time clustering that
would improve index scans that require index to base table look up.
Thinking about this, I wonder if we should add an option to compress table
to do this automatically? It would be very easy to do (ie. a one time
cluster - continual cluster would be a lot more work). All that would have
to happen is at compress time throw all the rows into the sorter and then
insert the result on the other side.
[Arindam] This is a great suggestion. Our DB is dynamic however only one
modification owner can write to it. The rest will see a copy. We can just
compress the master DB before we replicate it to the copies. By the way, I
tested on mysql - it is processing 1000 records in IN clause for the same DB
in 16 millis. I have increased the query page size to 16 MB. I will send the
execution plan as I get them.
On Wed, Apr 8, 2009 at 10:31 PM, Mike Matrigali <mikem_app@sbcglobal.net>wrote:
> Arindam Bhattacharjee wrote:
>
>> Thanks for the excellent summary Mike. The IN clause will have an unique
>> list of object ids. If we can return 100 object ids in good time, we are
>> okay. The users will be given an option to increase this to MAX 1000 -
>> however, they will be warned that the performance would be slower. The real
>> power of Derby according to us is the flexibility to manage the life cycle
>> of the database files without support from DB admins. We want to make the DB
>> management completely hidden from our customers. However, SolidDB and
>> TimesTen are good options which we can investigate but Derby has the
>> advantage of being used within our organization already. In general we liked
>> its memory footprint - except for one case of SQL parsing, it was mostly
>> well behaved.
>>
>> Question: will changing page size for the tables improve query speed? We
>> will always run one query through multiple threads on Derby and we will not
>> use it for any other purpose. We are okay to be a little slow on the
>> insertions but retrieval should be as quick and fast as possible. The
>> database will be created and replicated. The replicated databases will be
>> read-only.
>>
> I am not sure about page size but seems reasonable to try upping them all
> to 32k. I usually leave page size to the very end of tuning. Best is just
> to run the experiment. You definitely have enough rows so
> you won't be wasting any space going to bigger pages. The trade off is
> that i/o will be better but processing bigger pages requires more cpu
> for things like binary searches on the btree pages. But bigger pages may
> also shrink the height of the btree so again it is a tradeoff.
> Another downside is that memory footprint of derby will increase as Derby
> currently sizes it's cache by number of pages so where you may
> have 1000 (default cache size is 1000 pages) 4k pages cached you now might
> get 1000 32k pages.
>
> It sounded like your standard app tends to load a bunch of data and then
> deliver those dbs for read only access. Again this is a minor tweak but
> you may see some improvement if you can sort the data by primary key for
> insert into the base table. This will give you a one time clustering that
> would improve index scans that require index to base table look up.
> Thinking about this, I wonder if we should add an option to compress table
> to do this automatically? It would be very easy to do (ie. a one time
> cluster - continual cluster would be a lot more work). All that would have
> to happen is at compress time throw all the rows into the sorter and then
> insert the result on the other side.
>
>
>> Best regards,
>>
>> Arindam.
>>
>>
>> On Wed, Apr 8, 2009 at 6:23 PM, <derby@segel.com <mailto:derby@segel.com>>
>> wrote:
>>
>>
>> Arindam,
>>
>>
>> Ok,
>>
>>
>> Just a few things…
>>
>>
>> 1) Derby is a free, open source, product. Warts and all, its
>> not going to be your best choice for a commercial rdbms product.
>> (Sorry Derby fans, you get what you pay for.)
>>
>> 2) You want speed, you’re going to have to consider IBM’s
>> SolidDB which is an in memory database along with Oracle’s TimesTen
>> database. This will give you the speed that you want.
>>
>> 3) You’re running on a laptop where your disk drive could be a
>> 5400 rpm IDE drive. Laptop drive == low energy consumption and low
>> performance. Disk based solutions will be much slower on a laptop
>> than on a ‘comparable’ desktop and/or server.
>>
>> 4) I’m not sure why you had to load and then compress your
>> tables. I could understand that if you created your index prior to
>> loading the data that you could have to update the statistics.
>>
>> 5) Since we don’t know what you’re actually trying to do,
>> there could be a better or more efficient design. Having an IN
>> clause with 1000’s of entries is definitely not a good design for
>> any database.
>>
>> 6) You never did answer the question about the object ids that
>> were in the IN clause. Were they unique or were there duplicates?
>>
>>
>>
>> HTH
>>
>>
>> -Mike
>>
>>
>> ------------------------------------------------------------------------
>>
>> *From:* Arindam Bhattacharjee
>> [mailto:mr.arindam.bhattacharjee@gmail.com
>> <mailto:mr.arindam.bhattacharjee@gmail.com>]
>> *Sent:* Wednesday, April 08, 2009 2:14 AM
>>
>> *To:* Derby Discussion; msegel@segel.com <mailto:msegel@segel.com>
>>
>> *Subject:* Re: URGENT!!! JDBC SQL query taking long time for large
>> IN clause
>>
>>
>> Few inputs/observations:
>>
>> a) This is not a school project. This is a project which makes money
>> for us today. We have functioning product running today, developed
>> by us which doesn't use any RDBMS's to achieve what I am trying to
>> achieve with the tables. For 200,000 object master we get 45 - 60
>> millis performance. The reason we are trying to use an RDBMS is to
>> ensure that we can scale even higher and updates are easier and more
>> maintenable. Hence all these questions.
>>
>> b) I tried removing the "preserve rows" from the declare temp table
>> clause - and that didn't help me either.
>>
>> c) The data base which I have created is just a sample database
>> which vaguely represents the load which we will see if we implement
>> the existing feature of ours, using Derby RDBMS. It doesn't contain
>> REAL data.
>>
>> d) The entire database was created in one shot, as I have stated
>> earlier, and records are populated in one go. That was the actual
>> problem - thanks to Knuth for indicating that as a possibility. When
>> I compress the tables - the quries start performing a LOT faster! 1
>> million in object master, 20 million in object category mapping and
>> 10 K in category master and 1000 in IN clause gives me 1100 millis
>> speed now - down from 30 seconds. That probably explains why the
>> object master index scans were taking nearly 30 seconds in the
>> runtimestats which I posted. Which is slow, but still manageable
>> since the 100 and 500s are much faster. However, that is still a lot
>> slower than other databases we are testing on. Since Derby is
>> embedded we expected it to be much faster since no IPC is required
>> between the client driver and the network server.
>>
>> e) IN doesn't function properly beyond 1000 object ids, and we used
>> a regular table CREATE TABLE DUMMY (ID INTEGER PRIMARY KEY).
>> Insertion of 5000 records into this happened in 100 millis (after
>> compression of the tables) and the query took 4 seconds. With
>> executeUpdate() or execute() (as suggested) insertion was at least
>> 200% slower. So the execute batch works better for the inserts into
>> the table for the payloads required by us. The temp table query took
>> 10 MINS (this time I waited patiently for it to finish - there was
>> nothing wrong there - the query just takes too much time and I
>> didn't have patience to wait for it to finish the first time :) ) to
>> execute while the regular table with primary key took just 4 seconds.
>>
>> f) The stack overflow exception for "select ... values...union" was
>> much less of a problem in 10.4 compared to 10.1 since the memory
>> usage is tapering off to the -mx<Size In Megs> for the JVM. But,
>> since it cannot be parameterized it isn't of much use to us. IN
>> clause based queries are getting parsed with much lower memory
>> footprints.
>>
>> Best regards,
>>
>> Arindam.
>>
>> On Wed, Apr 8, 2009 at 2:39 AM, <derby@segel.com
>> <mailto:derby@segel.com>> wrote:
>>
>>
>> > -----Original Message-----
>> > From: Mike Matrigali [mailto:mikem_app@sbcglobal.net
>> <mailto:mikem_app@sbcglobal.net>]
>> > Sent: Tuesday, April 07, 2009 2:05 PM
>> > To: Derby Discussion
>>
>> > Subject: Re: URGENT!!! JDBC SQL query taking long time for large
>> IN clause
>> >
>>
>> > It is impossible to say what the performance of the query can be
>> without
>> > knowing exact values of all the values of the IN LIST. But it is
>> > possible to get some idea assuming some worst case behavior, and
>> from
>> > that I am going to guess you will never come close to 100ms with an
>> > uncached database, on hardware using some sort of standard disk
>> based
>> > hard drive.
>> >
>> > I do think the query may go faster with index and query tweeking,
>> but
>> > 100ms to an uncached db and non-clustered unique values in that IN
>> list
>> > is never going to go that fast. Adding up just what is posted it
>> looks
>> > like this is a 1.2 gig db.
>> >
>>
>> Drop the unnecessary indexes and you'll see the database size shrink
>> fast.
>> Also note that he's running this on a Windows XP laptop. Depending
>> on the
>> model of the lap top, you will have not only CPU issues but also
>> disk i/o
>> issues as well. (5400 rpm IDE as an example....)
>>
>> However, it is possible for the OP to get better performance, if not
>> realistically 100ms performance. (BTW where did 100ms come from? I'm
>> sorry
>> but this really sounds like a class project...)
>>
>>
>> > You posted the space for the tables and indexes. The interesting
>> ones
>> > are the big ones. You have 5 tables or indexes over 1000 pages
>> big. If
>> > in the worst case your 1000 value IN list happens to be on 1000
>> > different pages then Derby is going to need to do at least 1000
>> i/o's to
>> > get to them - I usually use back of envelope of max 100 i/o's per
>> second
>> > (even if your disk has specs that say higher rate this I/O is not
>> > going to
>> > get streamed as fast as possible by this query, it is going to ask
>> for
>> > page, process it, do some join work then later ask for another
>> page, ...)
>> > :
>> > > CATEGORY_MASTER 0 103 0 0 4096 0
>> > > SQL090406091302600 1 55 0 0 4096 0
>> > > SQL090406091302601 1 160 0 1 4096 0
>> > > SQL090406091302730 1 1 0 1 4096 0
>> > > OBJECT_MASTER 0 10497 0 0 4096 0
>> > > SQL090406091302760 1 5340 0 1 4096 0
>> > > SQL090406091302761 1 16708 0 410 4096 0
>> > > OBJECT_CATEGORY_MAPPING 0 150794 0 0 4096 0
>> > > OBJECT_CATEGORY_MAPPING_INDEX 1 112177 0 57 4096
>> 0
>> >
>>
>> Mike,
>>
>> I think that a lot of this information is a bit skewed. Outside of the
>> primary index, the indexes he created included the varchar field.
>> Not sure
>> why he did this except under the impression that he'd only have to
>> hit the
>> index and not the underlying table. While there is some potential
>> merit to
>> this, I think that there are things that he can do to improve
>> performance.
>> (Hence my post about reworking the query itself and using a temp
>> table.)
>> Drop those indexes and you'll see a big change in database size.
>>
>>
>> > There was work done in 10.3 on IN-LISTS, making them perform more
>> like
>> > unions, See DERBY-47. So if you have a choice of releases I would
>> > suggest you move to 10.4 and post query plan and results against
>> that.
>> > The basic idea of that change was to allow the
>> > system to do 1 probe into an index for each value in the IN-LIST,
>> before
>> > this change DERBY could only sort the values in the IN list and then
>> > limit a index scan to the lowest and biggest values in the in list.
>> > So for instance for OBJECT_CATEGORY_MAPPING_INDEX, worst case it
>> might
>> > have to scan 112177 pages to find the 1000 rows, where worst case
>> for
>> > probing would be 1000 page (plus btree parent index pages, but those
>> > are much more likely cached). The problem is that there is
>> definitely
>> > overhead for probing one at a time, scans go much faster - so there
>> is
>> > a crossover point - ie. I would guess it would likely better to
>> scan all
>> > 112177 pages then do 100,000 probes.
>> >
>>
>> I believe that it was already recommended that he do just that.
>> There are two ways he could use the temp table. As a sub-select
>> statement,
>> or as part of the table join.
>>
>> I think this would bypass the whole use of the IN list. I'm still
>> not 100%
>> sure why there's 100+ values coming from an outside source. Based on
>> his
>> query below it looks like the object_ids in the IN clause are not
>> unique...
>>
>> Its kind of hard trying to help someone when you don't know the whole
>> problem....
>>
>> -Mike
>>
>>
>> > arindam.bhattacharjee wrote:
>> > > Hello Knut,
>> > >
>> > > Thanks for your quick response. This is a sample database which
>> I have
>> > > created just for testing out the performance and has been written
>> to
>> > only
>> > > once in one go. I tried temp tables but that is just too slow.
>> The IN
>> > clause
>> > > has values which comes from another source and I can't modify
>> that.
>> > >
>> > > However, I will try out what you state below. But still, I
>> wanted to get
>> > > your pulse about whether Derby can respond in sub 100 millisec
>> time with
>> > the
>> > > table sizes you see above?
>> > >
>> > > I find that:
>> > >
>> > > select category_master.category_name,
>> > count(category_master.category_name)
>> > > as category_count
>> > > from
>> > > (
>> > > select internal.object_id
>> > > from
>> > > (
>> > > values(1001) union all
>> > > values(1001) union all
>> > > values(1001) union all
>> > > values(1001) union all
>> > > values(1002) union all
>> > > values(1001) union all
>> > > values(1001) union all
>> > > values(1001) union all
>> > > values(1001) union all
>> > > values(1001) union all
>> > > values(1001) union all
>> > > values(1001) union all .......
>> > > values(9999)
>> > > ) as internal(object_id)
>> > >
>> > > ) as external_ids,
>> > > object_master,
>> > > category_master,
>> > > object_category_mapping
>> > > where
>> > > external_ids.object_id = object_master.object_id and
>> > > external_ids.object_id = object_category_mapping.object_id and
>> > > object_master.object_id = object_category_mapping.object_id
>> and
>> > > category_master.category_id =
>> object_category_mapping.category_id
>> > > group by
>> > > category_master.category_name
>> > > order by
>> > > category_count desc
>> > >
>> > > is much faster unfortunately connection.prepareStatement() is
>> taking way
>> > too
>> > > much memory (both stack and heap - I have a constraint of 256 MB
>> MAX
>> > memory
>> > > for my JVM) which goes beyond my applications resources. Is
>> there a way
>> > I
>> > > can precompile some SQLs which are very expensive to parse during
>> > execution.
>> > >
>> > > Best regards,
>> > >
>> > > Arindam.
>> > >
>> > >
>> > > Knut Anders Hatlen wrote:
>> > >> "arindam.bhattacharjee" <mr.arindam.bhattacharjee@gmail.com
>> <mailto:mr.arindam.bhattacharjee@gmail.com>> writes:
>> > >>
>> > >>> Hello,
>> > >>>
>> > >>> I would like my query below to return within 100 millisecs.
>> Please
>> > help
>> > >>> me,
>> > >>> and the values for the IN clause comes from outside hence cannot
>> > really
>> > >>> change the IN clause to a join on an existing table.
>> > >> Hi Arindam,
>> > >>
>> > >> Does the query run faster if you compress all the tables
>> involved, or
>> > if
>> > >> you drop and recreate all the indexes? If so, it is likely that
>> the
>> > >> index cardinality statistics are out of date, which may make the
>> > >> optimizer pick a bad execution plan. Currently, index cardinality
>> > >> statistics are only updated at index creation time, when tables
>> are
>> > >> compressed, and when columns are dropped. A more automatic
>> solution is
>> > >> being worked on. For more details, see:
>> > >>
>> > >> https://issues.apache.org/jira/browse/DERBY-269
>> > >> https://issues.apache.org/jira/browse/DERBY-3788
>> > >> http://db.apache.org/derby/docs/10.4/tuning/ctunstats18908.html
>> > >>
>> > >> You may be experiencing some other problem, but this is a
>> problem that
>> > >> keeps coming up, so I think it's worth checking.
>> > >>
>> > >> Hope this helps,
>> > >>
>> > >> --
>> > >> Knut Anders
>> > >>
>> > >>
>> > >
>>
>>
>>
>>
>>
>
|