manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Performance Problems Continue
Date Wed, 17 Nov 2010 09:40:33 GMT
You are also probably seeing periods of time where ANALYZE or REINDEX
is being done.  Those won't go away unless/until PostgreSQL no longer
requires them to function efficiently.  There's one particular ANALYZE
operation that takes place right before every document stuffer query
is called, which I had to add for PostgreSQL 8.3.7, or else the plan
would go bad very quickly.  I'd love to get rid of it now, and it
would help overall performance if it worked too.  If you want to
experiment, it's in
modules/framework/pull-agent/src/main/java/org/apache/manifoldcf/crawler/jobs/JobManager.java,
on line 1655 in trunk.  If you comment out that line you may get
better performance, but if it was needed after all the crawl will
"stall" and you will see long-running queries start to appear in the
log at about the stall times.

Karl

On Wed, Nov 17, 2010 at 4:09 AM, Karl Wright <daddywri@gmail.com> wrote:
> How to get even more performance involves increasing worker thread
> count (and correspondingly, handle counts) until you achieve your
> goal.  :-)  I added some items to the FAQ which may interest you in
> this regard.  Beware that eventually you will start to run into other
> system limits, e.g. file handle limits on Unix (which you can increase
> with ulimit).  I am not too certain how well Windows systems do under
> persistent high load, either - you may simply not be able to hit high
> numbers of utilization there because of OS issues.  Linux I'm much
> more certain will not block you in this regard.
>
> Karl
>
> On Wed, Nov 17, 2010 at 12:19 AM, Farzad Valad <hokie@farzad.net> wrote:
>> Well... It worked!  Version 8.4.5 is it.  I ripped through 306944 items at
>> 57 docs/sec (1:29:26).  There are two other systems in my test environment,
>> one with solid state drives.  I'll upgrade those and let you know the final
>> results.  Perhaps we can publish these results including yours on the
>> manifoldcf site.  A question I have what should I increase to improve more,
>> there was room in both the CPU and disk activities.
>>
>> Also noticed from another email chain, talks about the user comments in the
>> old wiki.  I believe I contributed a lot to the user comments mentioned.
>> There are some good faq items, I can compile a list for review and
>> inclusion, then you can discard the rest of the comments.  Would that help
>> solve that problem?
>>
>> -----Original Message-----
>> From: Karl Wright [mailto:daddywri@gmail.com]
>> Sent: Tuesday, November 16, 2010 8:43 PM
>> To: connectors-user@incubator.apache.org
>> Subject: Re: Performance Problems Continue
>>
>> I've done a number of runs, added documents willy-nilly, and brought
>> the document count up to about 50,000.  I am still seeing reasonable
>> plans for this query:
>>
>> dbname=> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE
>> EXISTS(
>> SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>> ('1C33AEE63094C42C9537A9D94E8
>> CB2B903764190') AND t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND
>> t1.
>> jobid=1289847478234 AND t1.isnew='B');
>>                                                      
         QUERY PLAN
>>
>> ----------------------------------------------------------------------------
>> ---------------------------------------------------------------
>>  Nested Loop Semi Join  (cost=0.00..16.63 rows=1 width=127)
>>   Join Filter: ((t0.dochash)::text = (t1.childidhash)::text)
>>   ->  Index Scan using i1289954442565 on jobqueue t0
>> (cost=0.00..8.35 rows=1 width=127)
>>         Index Cond: (jobid = 1289847478234::bigint)
>>   ->  Index Scan using i1289954442546 on carrydown t1
>> (cost=0.00..8.27 rows=1 width=98)
>>         Index Cond: ((t1.jobid = 1289847478234::bigint) AND
>> ((t1.parentidhash)::text =
>> '1C33AEE63094C42C9537A9D94E8CB2B903764190'::text))
>>         Filter: (t1.isnew = 'B'::bpchar)
>> (7 rows)
>>
>>
>> So I suggest you upgrade to 8.4.4, and let me know what happens then.
>>
>> Thanks,
>> Karl
>>
>>
>> On Tue, Nov 16, 2010 at 8:15 PM, Farzad Valad <hokie@farzad.net> wrote:
>>> Sounds perfect!
>>>
>>> -----Original Message-----
>>> From: Karl Wright [mailto:daddywri@gmail.com]
>>> Sent: Tuesday, November 16, 2010 5:12 PM
>>> To: connectors-user@incubator.apache.org
>>> Subject: Re: Performance Problems Continue
>>>
>>> Turns out my memory is incorrect - the postgresql version that I've
>>> been using is 8.4.4, not 8.3.9.
>>> So there are a lot of possibilities here.
>>>
>>> What I'm going to try is to install the latest 8.4.5 on a different
>>> system, and do a large crawl.  Then I'll see if the plan is OK.  If it
>>> is OK, then I'll ask you to upgrade to that version as well, and see
>>> how you do.
>>>
>>> Does this sound reasonable?
>>> Karl
>>>
>>>
>>> On Tue, Nov 16, 2010 at 4:23 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>> I'm concerned that if I install postgresql 8.3.12, and do not
>>>> reproduce your problem, we'll have learned nothing.
>>>> I'll look around to see if I have the 8.3.9 installer still around...
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Nov 16, 2010 at 2:37 PM,  <hokie@farzad.net> wrote:
>>>>> Can't find the windows binaries on the web.  I found this link, but
it
>>>>> keeps saying file not found regardless of the mirror server I use.
>>>>>
>>>>>
>>>
>> http://wwwmaster.postgresql.org/download/mirrors-ftp/binary/v8.3.9/win32/pos
>>> tgresql-8.3.9-1.zip
>>>>>
>>>>> Seems like the fastest path would be for you to go up to .12, especially
>>>>> that is the version anyone can get their hands on.  Do you have the
>>>>> windows binaries or installer?
>>>>>
>>>>> Thanks,Farzad.
>>>>>
>>>>>
>>>>> Hi Farzad,
>>>>> The index it should be using but isn't is:
>>>>>
>>>>> public     | jobqueue  | i1289847375558 |            | CREATE
UNIQUE
>>>>> INDEX i1289847375558 ON jobqueue USING btree (dochash, jobid)
>>>>>
>>>>> Both the dochash value and the jobid value are specified for the
>>>>> jobqueue table in that query, so there's no reason it should not use
>>>>> this index.  It's either failed to consider it at all, or it has
>>>>> concluded that it's more efficient to scan the entire jobqueue table
>>>>> rather than using the index for the lookup.
>>>>>
>>>>> FWIW, I did run some explains on similar queries on my test system
>>>>> this morning before I left for work, and the plans for those involved
>>>>> something called an "index semi-join" which apparently means it scans
>>>>> both the jobqueue index and the carrydown index at the same time.  If
>>>>> I recall correctly, my version of PostgreSQL was 8.3.9.  Since yours
>>>>> is 8.3.12 it could well be that something was broken in PostgreSQL
>>>>> between releases.  If you have time, you might try to see if an
>>>>> earlier version of postgresql exhibits the same behavior for you.  If
>>>>> that works properly, the next step would be to create a ticket against
>>>>> PostgreSQL itself - or maybe there already is one.
>>>>>
>>>>> I expect that any of the many PostgreSQL books out there would have
>>>>> enough information in it to bring you up to speed.  The skill set for
>>>>> optimizing queries is also portable to other databases such as Oracle,
>>>>> if you care.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Nov 16, 2010 at 12:03 PM,  <hokie@farzad.net> wrote:
>>>>>> Found a query statement to pull the index data.  Is this what you
are
>>>>>> looking for?  On a separate note, I'm a bit over my head in terms
of
>>>>>> understanding, any recommendation of a site/sites or book to read
to
>>>>>> catch
>>>>>> up with needed db concepts?
>>>>>>
>>>>>> dbname=> select * from pg_indexes where tablename='jobqueue';
>>>>>>  schemaname | tablename |   indexname    | tablespace |
>>>>>>              indexdef
>>>>>>
>>>
>> ------------+-----------+----------------+------------+---------------------
>>> ----
>>>>>> -------------------------------------------------------------
>>>>>>  public     | jobqueue  | jobqueue_pkey  |            |
CREATE UNIQUE
>>>>>> INDEX jobq
>>>>>> ueue_pkey ON jobqueue USING btree (id)
>>>>>>  public     | jobqueue  | i1289847375565 |            |
CREATE INDEX
>>>>>> i1289847375
>>>>>> 565 ON jobqueue USING btree (jobid, status)
>>>>>>  public     | jobqueue  | i1289847375564 |            |
CREATE INDEX
>>>>>> i1289847375
>>>>>> 564 ON jobqueue USING btree (jobid, isseed)
>>>>>>  public     | jobqueue  | i1289847375563 |            |
CREATE INDEX
>>>>>> i1289847375
>>>>>> 563 ON jobqueue USING btree (jobid, dochash, status)
>>>>>>  public     | jobqueue  | i1289847375562 |            |
CREATE INDEX
>>>>>> i1289847375
>>>>>> 562 ON jobqueue USING btree (status)
>>>>>>  public     | jobqueue  | i1289847375561 |            |
CREATE INDEX
>>>>>> i1289847375
>>>>>> 561 ON jobqueue USING btree (checkaction, checktime, status)
>>>>>>  public     | jobqueue  | i1289847375560 |            |
CREATE INDEX
>>>>>> i1289847375
>>>>>> 560 ON jobqueue USING btree (priorityset, status)
>>>>>>  public     | jobqueue  | i1289847375559 |            |
CREATE INDEX
>>>>>> i1289847375
>>>>>> 559 ON jobqueue USING btree (docpriority)
>>>>>>  public     | jobqueue  | i1289847375558 |            |
CREATE UNIQUE
>>>>>> INDEX i128
>>>>>> 9847375558 ON jobqueue USING btree (dochash, jobid)
>>>>>> (9 rows)
>>>>>>
>>>>>> dbname=> select * from pg_indexes where tablename='carrydown';
>>>>>>  schemaname | tablename |   indexname    | tablespace |
>>>>>>                               indexdef
>>>>>>
>>>>>>
>>>
>> ------------+-----------+----------------+------------+---------------------
>>> ----
>>>>>>
>>>
>> ----------------------------------------------------------------------------
>>> ----
>>>>>> ----------------
>>>>>>  public     | carrydown | i1289847375547 |            |
CREATE INDEX
>>>>>> i1289847375
>>>>>> 547 ON carrydown USING btree (jobid, parentidhash)
>>>>>>  public     | carrydown | i1289847375546 |            |
CREATE INDEX
>>>>>> i1289847375
>>>>>> 546 ON carrydown USING btree (jobid, childidhash, dataname)
>>>>>>  public     | carrydown | i1289847375545 |            |
CREATE INDEX
>>>>>> i1289847375
>>>>>> 545 ON carrydown USING btree (jobid, childidhash, isnew)
>>>>>>  public     | carrydown | i1289847375544 |            |
CREATE UNIQUE
>>>>>> INDEX i128
>>>>>> 9847375544 ON carrydown USING btree (jobid, parentidhash, childidhash,
>>>>>> dataname,
>>>>>>  datavaluehash)
>>>>>> (4 rows)
>>>>>>
>>>>>> dbname=>
>>>>>>
>>>>>> Farzad.
>>>>>>
>>>>>>
>>>>>> psql usually displays the indexes for a table when you dt the table.
>>>>>> These are not appearing in your postgresql version.  But it's clear
>>>>>> they are there from the plan.
>>>>>>
>>>>>> The jobqueue table is being sequentially scanned *despite* the
>>>>>> probable fact that there's a reasonable, usable index on it that
it
>>>>>> should be using.  In short, this looks like a PostgreSQL bug to
me.
>>>>>>
>>>>>> I'm away from my test machine right now, but I'd love to see the
>>>>>> indexes on the jobqueue table that you have.  There may be a new
>>>>>> command in psql that my version of PostgreSQL doesn't have that prints
>>>>>> the indexes for a table, if they took it out of dt.
>>>>>>
>>>>>> If the proper index is there, then all we can do is try various forms
>>>>>> of the query to see if we can dodge the PostgreSQL bug.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 16, 2010 at 9:31 AM, Farzad Valad <hokie@farzad.net>
wrote:
>>>>>>> Here is the output:
>>>>>>>
>>>>>>>
>>>>>>> C:\Program Files (x86)\PostgreSQL\8.3\bin>psql -U manifoldcf
dbname
>>>>>>> Password for user manifoldcf:
>>>>>>> Welcome to psql 8.3.12, the PostgreSQL interactive terminal.
>>>>>>>
>>>>>>> Type:  \copyright for distribution terms
>>>>>>>       \h for help with SQL commands
>>>>>>>       \? for help with psql commands
>>>>>>>       \g or terminate with semicolon to execute query
>>>>>>>       \q to quit
>>>>>>>
>>>>>>> Warning: Console code page (437) differs from Windows code page
(1252)
>>>>>>>         8-bit characters might not work correctly. See psql
reference
>>>>>>>         page "Notes for Windows users" for details.
>>>>>>>
>>>>>>> dbname=> \dt jobqueue;
>>>>>>>           List of relations
>>>>>>>  Schema |   Name   | Type  |   Owner
>>>>>>> --------+----------+-------+------------
>>>>>>>  public | jobqueue | table | manifoldcf
>>>>>>> (1 row)
>>>>>>>
>>>>>>> dbname=> \dt carrydown;
>>>>>>>            List of relations
>>>>>>>  Schema |   Name    | Type  |   Owner
>>>>>>> --------+-----------+-------+------------
>>>>>>>  public | carrydown | table | manifoldcf
>>>>>>> (1 row)
>>>>>>>
>>>>>>> dbname=> select count(*) from carrydown;
>>>>>>>  count
>>>>>>> -------
>>>>>>>     0
>>>>>>> (1 row)
>>>>>>>
>>>>>>> dbname=> select distinct jobid from carrydown;
>>>>>>>  jobid
>>>>>>> -------
>>>>>>> (0 rows)
>>>>>>>
>>>>>>> dbname=> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue
t0
>>>>>>> WHERE
>>>>>>> EXISTS(
>>>>>>> SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>>>>>> dbname(> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>>>>>> t1.childidhash=t0.doch
>>>>>>> ash AND t0.jobid=1289847478234 AND
>>>>>>> dbname(> t1.jobid=1289847478234 AND t1.isnew='B');
>>>>>>>
>>>>>>>     QUERY PLAN
>>>>>>>
>>>>>>>
>>>
>> ----------------------------------------------------------------------------
>>>>>>> ----
>>>>>>>
>>>
>> ----------------------------------------------------------------------------
>>>>>>> ----
>>>>>>> --------------------
>>>>>>>  Seq Scan on jobqueue t0  (cost=0.00..2000133.07 rows=118994
>> width=158)
>>>>>>>   Filter: (subplan)
>>>>>>>   SubPlan
>>>>>>>     ->  Result  (cost=0.00..8.28 rows=1 width=0)
>>>>>>>           One-Time Filter: ($1 = 1289847478234::bigint)
>>>>>>>           ->  Index Scan using i1289847375544 on carrydown
t1
>>>>>>> (cost=0.00..8.28
>>>>>>>  rows=1 width=0)
>>>>>>>                 Index Cond: ((jobid = 1289847478234::bigint)
AND
>>>>>>> ((parentidhash
>>>>>>> )::text = 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND
>>>>>>> ((childidhash)::
>>>>>>> text = ($0)::text))
>>>>>>>                 Filter: (isnew = 'B'::bpchar)
>>>>>>> (8 rows)
>>>>>>>
>>>>>>> dbname=> analyze carrydown;
>>>>>>> ANALYZE
>>>>>>> dbname=> analyze jobqueue;
>>>>>>> ANALYZE
>>>>>>> dbname=> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue
t0
>>>>>>> WHERE
>>>>>>> dbname-> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash
IN
>>>>>>> dbname(> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>>>>>> dbname(> t1.childidhash=t0.dochash AND t0.jobid=1289847478234
AND
>>>>>>> dbname(> t1.jobid=1289847478234 AND t1.isnew='B');
>>>>>>>
>>>>>>>     QUERY PLAN
>>>>>>>
>>>>>>>
>>>
>> ----------------------------------------------------------------------------
>>>>>>> ----
>>>>>>>
>>>
>> ----------------------------------------------------------------------------
>>>>>>> ----
>>>>>>> --------------------
>>>>>>>  Seq Scan on jobqueue t0  (cost=0.00..2091946.86 rows=124532
>> width=158)
>>>>>>>   Filter: (subplan)
>>>>>>>   SubPlan
>>>>>>>     ->  Result  (cost=0.00..8.28 rows=1 width=0)
>>>>>>>           One-Time Filter: ($1 = 1289847478234::bigint)
>>>>>>>           ->  Index Scan using i1289847375544 on carrydown
t1
>>>>>>> (cost=0.00..8.28
>>>>>>>  rows=1 width=0)
>>>>>>>                 Index Cond: ((jobid = 1289847478234::bigint)
AND
>>>>>>> ((parentidhash
>>>>>>> )::text = 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND
>>>>>>> ((childidhash)::
>>>>>>> text = ($0)::text))
>>>>>>>                 Filter: (isnew = 'B'::bpchar)
>>>>>>> (8 rows)
>>>>>>>
>>>>>>> dbname=>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Karl Wright [mailto:daddywri@gmail.com]
>>>>>>> Sent: Tuesday, November 16, 2010 4:21 AM
>>>>>>> To: connectors-user@incubator.apache.org
>>>>>>> Subject: Re: Performance Problems Continue
>>>>>>>
>>>>>>> As I suspected, the plan it is generating is crap:
>>>>>>>
>>>>>>> Found a query that took more than a minute (62494 ms): [SELECT
>>>>>>> t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT
'x'
>>>>>>> FROM carrydown t1 WHERE parentidhash IN (?) AND
>>>>>>> t1.childidhash=t0.dochash AND t0.jobid=? AND t1.jobid=? AND
>>>>>>> t1.isnew=?)]
>>>>>>>  Parameter 0: 'B0C2E57717919821A7C32D9F92F9F297155B727F'
>>>>>>>  Parameter 1: '1289847478234'
>>>>>>>  Parameter 2: '1289847478234'
>>>>>>>  Parameter 3: 'B'
>>>>>>>  Plan: Seq Scan on jobqueue t0  (cost=0.00..1332448.02 rows=79465
>>>>>>> width=152)
>>>>>>>  Plan:   Filter: (subplan)
>>>>>>>  Plan:   SubPlan
>>>>>>>  Plan:     ->  Result  (cost=0.00..8.28 rows=1 width=0)
>>>>>>>  Plan:           One-Time Filter: ($1 = 1289847478234::bigint)
>>>>>>>  Plan:           ->  Index Scan using i1289847375544
on carrydown t1
>>>>>>> (cost=0.00..8.28 rows=1 width=0)
>>>>>>>  Plan:                 Index Cond: ((jobid = 1289847478234::bigint)
>>>>>>> AND ((parentidhash)::text =
>>>>>>> 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND
>>>>>>> ((childidhash)::text = ($0)::text))
>>>>>>>  Plan:                 Filter: (isnew = 'B'::bpchar)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ... which means that it is doing a sequential scan on the jobqueue
>>>>>>> table (large) and an index scan for each row on the carrydown
table
>>>>>>> (which should be empty), rather than the other way round.  There
is a
>>>>>>> carrydown index that is should be using instead, which it is
>>>>>>> completely ignoring, which is all the more weird.
>>>>>>>
>>>>>>> Can you do the following:
>>>>>>>
>>>>>>> (1) Under the postgresql bin directory, run psql -U manifoldcf
dbname
>>>>>>> (2) If the defaults are in place, password should be "local_pg_passwd"
>>>>>>> (3) Run the following commands, and cut/paste the answers for
me:
>>>>>>>
>>>>>>> \dt jobqueue;
>>>>>>> \dt carrydown;
>>>>>>> select count(*) from carrydown;
>>>>>>> select distinct jobid from carrydown;
>>>>>>> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE
>>>>>>> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>>>>>> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>>>>>> t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND
>>>>>>> t1.jobid=1289847478234 AND t1.isnew='B');
>>>>>>>
>>>>>>> Once you have done that, please do the following:
>>>>>>>
>>>>>>> analyze carrydown;
>>>>>>> analyze jobqueue;
>>>>>>>
>>>>>>> Then, try the explain again. and send me that output too:
>>>>>>>
>>>>>>> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE
>>>>>>> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>>>>>> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>>>>>> t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND
>>>>>>> t1.jobid=1289847478234 AND t1.isnew='B');
>>>>>>>
>>>>>>>
>>>>>>> I've tried the same commands on my system, but they show reasonable
>>>>>>> plans.  PostgreSQL, like all databases, develops its plans based
on
>>>>>>> statistics it keeps for the tables in question.  The analyze
command
>>>>>>> tells it to update those statistics, which in theory should make
>>>>>>> PostgreSQL immediately thereafter use a correct plan.  However,
there
>>>>>>> may be great sensitivity to incorrect statistics for some sorts
of
>>>>>>> query, depending on the conditions, so a small number of accumulated
>>>>>>> changes can cause it to start to make huge errors.  I am trying
to
>>>>>>> determine if that's what is happening, and what those conditions
are
>>>>>>> in your case.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 15, 2010 at 9:44 PM, Farzad Valad <hokie@farzad.net>
>> wrote:
>>>>>>>> 1. Using 8.3.12, gave up on 9.x
>>>>>>>> 2. Yes, I do.  Attached the log file for this crawl
>>>>>>>> 3. Using Null Output and File System input, not using RSS
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>

Mime
View raw message