db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arindam Bhattacharjee <mr.arindam.bhattachar...@gmail.com>
Subject Re: URGENT!!! JDBC SQL query taking long time for large IN clause
Date Wed, 08 Apr 2009 17:30:27 GMT
t sounded like your standard app tends to load a bunch of data and then
deliver those dbs for read only access.  Again this is a minor tweak but
you may see some improvement if you can sort the data by primary key for
insert into the base table.  This will give you a one time clustering that
would improve index scans that require index to base table look up.
Thinking about this, I wonder if we should add an option to compress table
to do this automatically?  It would be very easy to do (ie. a one time
cluster - continual cluster would be a lot more work). All that would have
to happen is at compress time throw all the rows into the sorter and then
insert the result on the other side.
[Arindam] This is a great suggestion. Our DB is dynamic however only one
modification owner can write to it. The rest will see a copy. We can just
compress the master DB before we replicate it to the copies. By the way, I
tested on mysql - it is processing 1000 records in IN clause for the same DB
in 16 millis. I have increased the query page size to 16 MB. I will send the
execution plan as I get them.

On Wed, Apr 8, 2009 at 10:31 PM, Mike Matrigali <mikem_app@sbcglobal.net>wrote:

> Arindam Bhattacharjee wrote:
>
>> Thanks for the excellent summary Mike. The IN clause will have an unique
>> list of object ids. If we can return 100 object ids in good time, we are
>> okay. The users will be given an option to increase this to MAX 1000 -
>> however, they will be warned that the performance would be slower. The real
>> power of Derby according to us is the flexibility to manage the life cycle
>> of the database files without support from DB admins. We want to make the DB
>> management completely hidden from our customers. However, SolidDB and
>> TimesTen are good options which we can investigate but Derby has the
>> advantage of being used within our organization already. In general we liked
>> its memory footprint - except for one case of SQL parsing, it was mostly
>> well behaved.
>>
>> Question: will changing page size for the tables improve query speed? We
>> will always run one query through multiple threads on Derby and we will not
>> use it for any other purpose. We are okay to be a little slow on the
>> insertions but retrieval should be as quick and fast as possible. The
>> database will be created and replicated. The replicated databases will be
>> read-only.
>>
> I am not sure about page size but seems reasonable to try upping them all
> to 32k.  I usually leave page size to the very end of tuning.  Best is just
> to run the experiment.  You definitely have enough rows so
> you won't be wasting any space going to bigger pages.  The trade off is
> that i/o will be better but processing bigger pages requires more cpu
> for things like binary searches on the btree pages.  But bigger pages may
> also shrink the height of the btree so again it is a tradeoff.
> Another downside is that memory footprint of derby will increase as Derby
> currently sizes it's cache by number of pages so where you may
> have 1000 (default cache size is 1000 pages) 4k pages cached you now might
> get 1000 32k pages.
>
> It sounded like your standard app tends to load a bunch of data and then
> deliver those dbs for read only access.  Again this is a minor tweak but
> you may see some improvement if you can sort the data by primary key for
> insert into the base table.  This will give you a one time clustering that
> would improve index scans that require index to base table look up.
> Thinking about this, I wonder if we should add an option to compress table
> to do this automatically?  It would be very easy to do (ie. a one time
> cluster - continual cluster would be a lot more work). All that would have
> to happen is at compress time throw all the rows into the sorter and then
> insert the result on the other side.
>
>
>> Best regards,
>>
>> Arindam.
>>
>>
>> On Wed, Apr 8, 2009 at 6:23 PM, <derby@segel.com <mailto:derby@segel.com>>
>> wrote:
>>
>>
>>    Arindam,
>>
>>
>>    Ok,
>>
>>
>>    Just a few things…
>>
>>
>>    1)       Derby is a free, open source, product. Warts and all, its
>>    not going to be your best choice for a commercial rdbms product.
>>    (Sorry Derby fans, you get what you pay for.)
>>
>>    2)       You want speed, you’re going to have to consider IBM’s
>>    SolidDB which is an in memory database along with Oracle’s TimesTen
>>    database. This will give you the speed that you want.
>>
>>    3)       You’re running on a laptop where your disk drive could be a
>>    5400 rpm IDE drive. Laptop drive == low energy consumption and low
>>    performance. Disk based solutions will be much slower on a laptop
>>    than on a ‘comparable’ desktop and/or server.
>>
>>    4)       I’m not sure why you had to load and then compress your
>>    tables. I could understand that if you created your index prior to
>>    loading the data that you could have to update the statistics.
>>
>>    5)       Since we don’t know what you’re actually trying to do,
>>    there could be a better or more efficient design. Having an IN
>>    clause with 1000’s of entries is definitely not a good design for
>>    any database.
>>
>>    6)       You never did answer the question about the object ids that
>>    were in the IN clause. Were they unique or were there duplicates?
>>
>>
>>
>>    HTH
>>
>>
>>    -Mike
>>
>>
>>  ------------------------------------------------------------------------
>>
>>    *From:* Arindam Bhattacharjee
>>    [mailto:mr.arindam.bhattacharjee@gmail.com
>>    <mailto:mr.arindam.bhattacharjee@gmail.com>]
>>    *Sent:* Wednesday, April 08, 2009 2:14 AM
>>
>>    *To:* Derby Discussion; msegel@segel.com <mailto:msegel@segel.com>
>>
>>    *Subject:* Re: URGENT!!! JDBC SQL query taking long time for large
>>    IN clause
>>
>>
>>    Few inputs/observations:
>>
>>    a) This is not a school project. This is a project which makes money
>>    for us today. We have functioning product running today, developed
>>    by us which doesn't use any RDBMS's to achieve what I am trying to
>>    achieve with the tables. For 200,000 object master we get 45 - 60
>>    millis performance. The reason we are trying to use an RDBMS is to
>>    ensure that we can scale even higher and updates are easier and more
>>    maintenable. Hence all these questions.
>>
>>    b) I tried removing the "preserve rows" from the declare temp table
>>    clause - and that didn't help me either.
>>
>>    c) The data base which I have created is just a sample database
>>    which vaguely represents the load which we will see if we implement
>>    the existing feature of ours, using Derby RDBMS. It doesn't contain
>>    REAL data.
>>
>>    d) The entire database was created in one shot, as I have stated
>>    earlier, and records are populated in one go. That was the actual
>>    problem - thanks to Knuth for indicating that as a possibility. When
>>    I compress the tables - the quries start performing a LOT faster! 1
>>    million in object master, 20 million in object category mapping and
>>    10 K in category master and 1000 in IN clause gives me 1100 millis
>>    speed now - down from 30 seconds. That probably explains why the
>>    object master index scans were taking nearly 30 seconds in the
>>    runtimestats which I posted. Which is slow, but still manageable
>>    since the 100 and 500s are much faster. However, that is still a lot
>>    slower than other databases we are testing on. Since Derby is
>>    embedded we expected it to be much faster since no IPC is required
>>    between the client driver and the network server.
>>
>>    e) IN doesn't function properly beyond 1000 object ids, and we used
>>    a regular table CREATE TABLE DUMMY (ID INTEGER PRIMARY KEY).
>>    Insertion of 5000 records into this happened in 100 millis (after
>>    compression of the tables) and the query took 4 seconds. With
>>    executeUpdate() or execute() (as suggested) insertion was at least
>>    200% slower. So the execute batch works better for the inserts into
>>    the table for the payloads required by us. The temp table query took
>>    10 MINS (this time I waited patiently for it to finish - there was
>>    nothing wrong there - the query just takes too much time and I
>>    didn't have patience to wait for it to finish the first time :) ) to
>>    execute while the regular table with primary key took just 4 seconds.
>>
>>    f) The stack overflow exception for "select ... values...union" was
>>    much less of a problem in 10.4 compared to 10.1 since the memory
>>    usage is tapering off to the -mx<Size In Megs> for the JVM. But,
>>    since it cannot be parameterized it isn't of much use to us. IN
>>    clause based queries are getting parsed with much lower memory
>>    footprints.
>>
>>    Best regards,
>>
>>    Arindam.
>>
>>    On Wed, Apr 8, 2009 at 2:39 AM, <derby@segel.com
>>    <mailto:derby@segel.com>> wrote:
>>
>>
>>    >  -----Original Message-----
>>    >  From: Mike Matrigali [mailto:mikem_app@sbcglobal.net
>>    <mailto:mikem_app@sbcglobal.net>]
>>    >  Sent: Tuesday, April 07, 2009 2:05 PM
>>    >  To: Derby Discussion
>>
>>    >  Subject: Re: URGENT!!! JDBC SQL query taking long time for large
>>    IN clause
>>    >
>>
>>    >  It is impossible to say what the performance of the query can be
>>    without
>>    >  knowing exact values of all the values of the IN LIST.  But it is
>>    >  possible to get some idea assuming some worst case behavior, and
>> from
>>    >  that I am going to guess you will never come close to 100ms with an
>>    >  uncached database, on hardware using some sort of standard disk
>> based
>>    >  hard drive.
>>    >
>>    >  I do think the query may go faster with index and query tweeking,
>> but
>>    >  100ms to an uncached db and non-clustered unique values in that IN
>>    list
>>    >  is never going to go that fast.  Adding up just what is posted it
>>    looks
>>    >  like this is a 1.2 gig db.
>>    >
>>
>>    Drop the unnecessary indexes and you'll see the database size shrink
>>    fast.
>>    Also note that he's running this on a Windows XP laptop. Depending
>>    on the
>>    model of the lap top, you will have not only CPU issues but also
>>    disk i/o
>>    issues as well. (5400 rpm IDE as an example....)
>>
>>    However, it is possible for the OP to get better performance, if not
>>    realistically 100ms performance. (BTW where did 100ms come from? I'm
>>    sorry
>>    but this really sounds like a class project...)
>>
>>
>>    >  You posted the space for the tables and indexes.  The interesting
>> ones
>>    >  are the big ones.  You have 5 tables or indexes over 1000 pages
>>    big.  If
>>    >  in the worst case your 1000 value IN list happens to be on 1000
>>    >  different pages then Derby is going to need to do at least 1000
>>    i/o's to
>>    >  get to them - I usually use back of envelope of max 100 i/o's per
>>    second
>>    >  (even if your disk has specs that say higher rate this I/O is not
>>    >  going to
>>    >  get streamed as fast as possible by this query, it is going to ask
>> for
>>    >  page, process it, do some join work then later ask for another
>>    page, ...)
>>    >  :
>>    >  > CATEGORY_MASTER            0    103    0    0    4096    0
>>    >  > SQL090406091302600        1    55    0    0    4096    0
>>    >  > SQL090406091302601        1    160    0    1    4096    0
>>    >  > SQL090406091302730        1    1    0    1    4096    0
>>    >  > OBJECT_MASTER            0    10497    0    0    4096    0
>>    >  > SQL090406091302760        1    5340    0    1    4096    0
>>    >  > SQL090406091302761        1    16708    0    410    4096    0
>>    >  > OBJECT_CATEGORY_MAPPING        0    150794    0    0    4096    0
>>    >  > OBJECT_CATEGORY_MAPPING_INDEX    1    112177    0    57    4096
>>  0
>>    >
>>
>>    Mike,
>>
>>    I think that a lot of this information is a bit skewed. Outside of the
>>    primary index, the indexes he created included the varchar field.
>>    Not sure
>>    why he did this except under the impression that he'd only have to
>>    hit the
>>    index and not the underlying table. While there is some potential
>>    merit to
>>    this, I think that there are things that he can do to improve
>>    performance.
>>    (Hence my post about reworking the query itself and using a temp
>> table.)
>>    Drop those indexes and you'll see a big change in database size.
>>
>>
>>    >  There was work done in 10.3 on IN-LISTS, making them perform more
>> like
>>    >  unions,  See DERBY-47.  So if you have a choice of releases I would
>>    >  suggest you move to 10.4 and post query plan and results against
>> that.
>>    >  The basic idea of that change was to allow the
>>    >  system to do 1 probe into an index for each value in the IN-LIST,
>>    before
>>    >  this change DERBY could only sort the values in the IN list and then
>>    >  limit a index scan to the lowest and biggest values in the in list.
>>    >  So for instance for OBJECT_CATEGORY_MAPPING_INDEX, worst case it
>> might
>>    >  have to scan 112177 pages to find the 1000 rows, where worst case
>> for
>>    >  probing would be 1000 page (plus btree parent index pages, but those
>>    >  are much more likely cached).  The problem is that there is
>> definitely
>>    >  overhead for probing one at a time, scans go much faster - so there
>> is
>>    >  a crossover point - ie. I would guess it would likely better to
>>    scan all
>>    >  112177 pages then do 100,000 probes.
>>    >
>>
>>    I believe that it was already recommended that he do just that.
>>    There are two ways he could use the temp table. As a sub-select
>>    statement,
>>    or as part of the table join.
>>
>>    I think this would bypass the whole use of the IN list. I'm still
>>    not 100%
>>    sure why there's 100+ values coming from an outside source. Based on
>> his
>>    query below it looks like the object_ids in the IN clause are not
>>    unique...
>>
>>    Its kind of hard trying to help someone when you don't know the whole
>>    problem....
>>
>>    -Mike
>>
>>
>>    >  arindam.bhattacharjee wrote:
>>    >  > Hello Knut,
>>    >  >
>>    >  > Thanks for your quick response. This is a sample database which
>>    I have
>>    >  > created just for testing out the performance and has been written
>> to
>>    >  only
>>    >  > once in one go. I tried temp tables but that is just too slow.
>>    The IN
>>    >  clause
>>    >  > has values which comes from another source and I can't modify
>> that.
>>    >  >
>>    >  > However, I will try out what you state below. But still, I
>>    wanted to get
>>    >  > your pulse about whether Derby can respond in sub 100 millisec
>>    time with
>>    >  the
>>    >  > table sizes you see above?
>>    >  >
>>    >  > I find that:
>>    >  >
>>    >  > select category_master.category_name,
>>    >  count(category_master.category_name)
>>    >  > as category_count
>>    >  > from
>>    >  >     (
>>    >  >             select internal.object_id
>>    >  >             from
>>    >  >             (
>>    >  >                     values(1001) union all
>>    >  >                     values(1001) union all
>>    >  >                     values(1001) union all
>>    >  >                     values(1001) union all
>>    >  >                     values(1002) union all
>>    >  >                     values(1001) union all
>>    >  >                     values(1001) union all
>>    >  >                     values(1001) union all
>>    >  >                     values(1001) union all
>>    >  >                     values(1001) union all
>>    >  >                     values(1001) union all
>>    >  >                     values(1001) union all .......
>>    >  >                     values(9999)
>>    >  >             ) as internal(object_id)
>>    >  >
>>    >  >     ) as external_ids,
>>    >  >     object_master,
>>    >  >     category_master,
>>    >  >     object_category_mapping
>>    >  > where
>>    >  >     external_ids.object_id = object_master.object_id and
>>    >  >     external_ids.object_id = object_category_mapping.object_id and
>>    >  >     object_master.object_id = object_category_mapping.object_id
>> and
>>    >  >     category_master.category_id =
>>    object_category_mapping.category_id
>>    >  > group by
>>    >  >     category_master.category_name
>>    >  > order by
>>    >  >     category_count desc
>>    >  >
>>    >  > is much faster unfortunately connection.prepareStatement() is
>>    taking way
>>    >  too
>>    >  > much memory (both stack and heap - I have a constraint of 256 MB
>> MAX
>>    >  memory
>>    >  > for my JVM) which goes beyond my applications resources. Is
>>    there a way
>>    >  I
>>    >  > can precompile some SQLs which are very expensive to parse during
>>    >  execution.
>>    >  >
>>    >  > Best regards,
>>    >  >
>>    >  > Arindam.
>>    >  >
>>    >  >
>>    >  > Knut Anders Hatlen wrote:
>>    >  >> "arindam.bhattacharjee" <mr.arindam.bhattacharjee@gmail.com
>>    <mailto:mr.arindam.bhattacharjee@gmail.com>> writes:
>>    >  >>
>>    >  >>> Hello,
>>    >  >>>
>>    >  >>> I would like my query below to return within 100 millisecs.
>> Please
>>    >  help
>>    >  >>> me,
>>    >  >>> and the values for the IN clause comes from outside hence cannot
>>    >  really
>>    >  >>> change the IN clause to a join on an existing table.
>>    >  >> Hi Arindam,
>>    >  >>
>>    >  >> Does the query run faster if you compress all the tables
>>    involved, or
>>    >  if
>>    >  >> you drop and recreate all the indexes? If so, it is likely that
>> the
>>    >  >> index cardinality statistics are out of date, which may make the
>>    >  >> optimizer pick a bad execution plan. Currently, index cardinality
>>    >  >> statistics are only updated at index creation time, when tables
>> are
>>    >  >> compressed, and when columns are dropped. A more automatic
>>    solution is
>>    >  >> being worked on. For more details, see:
>>    >  >>
>>    >  >> https://issues.apache.org/jira/browse/DERBY-269
>>    >  >> https://issues.apache.org/jira/browse/DERBY-3788
>>    >  >> http://db.apache.org/derby/docs/10.4/tuning/ctunstats18908.html
>>    >  >>
>>    >  >> You may be experiencing some other problem, but this is a
>>    problem that
>>    >  >> keeps coming up, so I think it's worth checking.
>>    >  >>
>>    >  >> Hope this helps,
>>    >  >>
>>    >  >> --
>>    >  >> Knut Anders
>>    >  >>
>>    >  >>
>>    >  >
>>
>>
>>
>>
>>
>

Mime
View raw message