manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shigeki Kobayashi <shigeki.kobayas...@g.softbank.co.jp>
Subject Re: Too many slow queries caused by MCF running MySQL 5.5
Date Mon, 10 Dec 2012 10:27:05 GMT
Hi Karl,

Thanks for the reply.

I did EXPLAIN as following:

mysql> explain SELECT
    -> t0.id
,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
    -> FROM jobqueue t0 WHERE t0.docpriority >= 0 AND t0.status IN ('P','G')
    -> AND t0.checkaction='R' AND
    -> t0.checktime<=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
    -> t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
    -> EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
t2.status
    -> IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
EXISTS(SELECT
    -> 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
    -> t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
    -> ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
+----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+--------+-----------------------------+
| id | select_type        | table | type   | possible_keys
               | key            | key_len | ref                     | rows
  | Extra                       |
+----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+--------+-----------------------------+
|  1 | PRIMARY            | t0    | range  |
I1354241297073,I1354241297072,I1354241297071 | I1354241297071 | 25      |
NULL                    | 151494 | Using where; Using filesort |
|  4 | DEPENDENT SUBQUERY | t3    | ref    | I1354241297077
              | I1354241297077 | 8       | manifoldcf.t0.id        |      1
|                             |
|  4 | DEPENDENT SUBQUERY | t4    | eq_ref | PRIMARY
               | PRIMARY        | 767     | manifoldcf.t3.eventname |
 1 | Using index                 |
|  3 | DEPENDENT SUBQUERY | t2    | ref    |
I1354241297070,I1354241297073,I1354241297072 | I1354241297070 | 122     |
manifoldcf.t0.dochash   |      1 | Using where                 |
|  2 | DEPENDENT SUBQUERY | t1    | eq_ref | PRIMARY,I1354241297080
              | PRIMARY        | 8       | manifoldcf.t0.jobid     |      1
| Using where                 |
+----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+--------+-----------------------------+


As you see "Using filesort", I do not think it uses the index.

By the way, which database do you recommend for the case of crawling  a
humongous number of files for now? PostgreSQL?


Regards,

Shigeki

2012/12/10 Karl Wright <daddywri@gmail.com>

> Since you have a large table, can you try an EXPLAIN for the following
> query, which should match the explanation given here:
> http://dev.mysql.com/doc/refman/5.5/en/order-by-optimization.html ?
> Does it use the index?
>
> SELECT
> t0.id
> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
> FROM jobqueue t0 WHERE t0.docpriority >= 0 AND t0.status IN ('P','G')
> AND t0.checkaction='R' AND
> t0.checktime<=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
> t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
> EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
> t2.status
> IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT
> 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
> t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
> ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200
>
> Thanks!
> Karl
>
> On Mon, Dec 10, 2012 at 2:49 AM, Karl Wright <daddywri@gmail.com> wrote:
> > Hi Shigeki,
> >
> > The rules for when a database will use an index for an ORDER BY clause
> > differ significantly from database to database.  The current logic
> > seems to satisfy PostgreSQL, HSQLDB, and Derby, but clearly not MySQL.
> >  I will see if I can find a solution.  The ticket for this
> > CONNECTORS-584.
> >
> > Karl
> >
> > On Mon, Dec 10, 2012 at 2:13 AM, Shigeki Kobayashi
> > <shigeki.kobayashi3@g.softbank.co.jp> wrote:
> >>
> >> Hi.
> >>
> >>
> >> I downloaded MCF1.1dev on Nov, 29th, and ran it using MySQL
> >> I tried to crawl 10 million files using Windows share connection and
> index
> >> them into Solr.
> >>
> >> As MCF reached over 1 million files, the crawling speed started getting
> >> slower.
> >> So I checked slow queries and found out that too many slow queries
> occurred,
> >> especially the following kinds:
> >>
> >> --------------------------------------------------------
> >> # Time: 121204 16:25:40
> >> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
> >> # Query_time: 7.240532  Lock_time: 0.000204 Rows_sent: 1200
>  Rows_examined:
> >> 611091
> >> SET timestamp=1354605940;
> >> SELECT
> >> t0.id
> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
> >> FROM jobqueue t0 WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND
> >> t0.checktime<=1354605932817 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
> >> t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT
> >> EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND
> t2.status
> >> IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT
> EXISTS(SELECT
> >> 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
> >> t3.eventname=t4.name) ORDER BY t0.docpriority ASC,t0.status
> >> ASC,t0.checkaction ASC,t0.checktime ASC LIMIT 1200;
> >> # Time: 121204 16:25:44
> >> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
> >> # Query_time: 3.064339  Lock_time: 0.000084 Rows_sent: 1  Rows_examined:
> >> 406359
> >> SET timestamp=1354605944;
> >> SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 WHERE status IN
> >> ('P','G') AND checkaction='R' AND checktime<=1354605932817 AND
> EXISTS(SELECT
> >> 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid)
>  ORDER BY
> >> docpriority ASC,status ASC,checkaction ASC,checktime ASC LIMIT 1;
> >> -------------------------------------------------------
> >>
> >> I wonder if the queries appropriately use index of the table.
> >> As a result of EXPLAIN against the slow query, there was filesort.
> >> There seems to be some conditions that MySQL does not use index
> depending on
> >> ORDER BY:
> >>  - Executing ORDER BY against multiple keys
> >>  - When keys selected from records are different from keys used by
> ORDER BY
> >>
> >> Since filesort was happening, fully scanning records should be having
> MCF
> >> slower.
> >>
> >> Do you think this could happen even in PostgreSQL or HSQLDB?
> >> Do you think queries could be modified to use index appropriately?
> >>
> >>
> >> Regards,
> >>
> >> Shigeki
>

Mime
View raw message