manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Farzad Valad" <ho...@farzad.net>
Subject RE: Performance Problems Continue
Date Wed, 17 Nov 2010 05:19:58 GMT
Well... It worked!  Version 8.4.5 is it.  I ripped through 306944 items at
57 docs/sec (1:29:26).  There are two other systems in my test environment,
one with solid state drives.  I'll upgrade those and let you know the final
results.  Perhaps we can publish these results including yours on the
manifoldcf site.  A question I have what should I increase to improve more,
there was room in both the CPU and disk activities.

Also noticed from another email chain, talks about the user comments in the
old wiki.  I believe I contributed a lot to the user comments mentioned.
There are some good faq items, I can compile a list for review and
inclusion, then you can discard the rest of the comments.  Would that help
solve that problem?

-----Original Message-----
From: Karl Wright [mailto:daddywri@gmail.com] 
Sent: Tuesday, November 16, 2010 8:43 PM
To: connectors-user@incubator.apache.org
Subject: Re: Performance Problems Continue

I've done a number of runs, added documents willy-nilly, and brought
the document count up to about 50,000.  I am still seeing reasonable
plans for this query:

dbname=> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE
EXISTS(
SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
('1C33AEE63094C42C9537A9D94E8
CB2B903764190') AND t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND
t1.
jobid=1289847478234 AND t1.isnew='B');
                                                                QUERY PLAN

----------------------------------------------------------------------------
---------------------------------------------------------------
 Nested Loop Semi Join  (cost=0.00..16.63 rows=1 width=127)
   Join Filter: ((t0.dochash)::text = (t1.childidhash)::text)
   ->  Index Scan using i1289954442565 on jobqueue t0
(cost=0.00..8.35 rows=1 width=127)
         Index Cond: (jobid = 1289847478234::bigint)
   ->  Index Scan using i1289954442546 on carrydown t1
(cost=0.00..8.27 rows=1 width=98)
         Index Cond: ((t1.jobid = 1289847478234::bigint) AND
((t1.parentidhash)::text =
'1C33AEE63094C42C9537A9D94E8CB2B903764190'::text))
         Filter: (t1.isnew = 'B'::bpchar)
(7 rows)


So I suggest you upgrade to 8.4.4, and let me know what happens then.

Thanks,
Karl


On Tue, Nov 16, 2010 at 8:15 PM, Farzad Valad <hokie@farzad.net> wrote:
> Sounds perfect!
>
> -----Original Message-----
> From: Karl Wright [mailto:daddywri@gmail.com]
> Sent: Tuesday, November 16, 2010 5:12 PM
> To: connectors-user@incubator.apache.org
> Subject: Re: Performance Problems Continue
>
> Turns out my memory is incorrect - the postgresql version that I've
> been using is 8.4.4, not 8.3.9.
> So there are a lot of possibilities here.
>
> What I'm going to try is to install the latest 8.4.5 on a different
> system, and do a large crawl.  Then I'll see if the plan is OK.  If it
> is OK, then I'll ask you to upgrade to that version as well, and see
> how you do.
>
> Does this sound reasonable?
> Karl
>
>
> On Tue, Nov 16, 2010 at 4:23 PM, Karl Wright <daddywri@gmail.com> wrote:
>> I'm concerned that if I install postgresql 8.3.12, and do not
>> reproduce your problem, we'll have learned nothing.
>> I'll look around to see if I have the 8.3.9 installer still around...
>>
>> Karl
>>
>>
>> On Tue, Nov 16, 2010 at 2:37 PM,  <hokie@farzad.net> wrote:
>>> Can't find the windows binaries on the web.  I found this link, but it
>>> keeps saying file not found regardless of the mirror server I use.
>>>
>>>
>
http://wwwmaster.postgresql.org/download/mirrors-ftp/binary/v8.3.9/win32/pos
> tgresql-8.3.9-1.zip
>>>
>>> Seems like the fastest path would be for you to go up to .12, especially
>>> that is the version anyone can get their hands on.  Do you have the
>>> windows binaries or installer?
>>>
>>> Thanks,Farzad.
>>>
>>>
>>> Hi Farzad,
>>> The index it should be using but isn't is:
>>>
>>> public     | jobqueue  | i1289847375558 |            | CREATE UNIQUE
>>> INDEX i1289847375558 ON jobqueue USING btree (dochash, jobid)
>>>
>>> Both the dochash value and the jobid value are specified for the
>>> jobqueue table in that query, so there's no reason it should not use
>>> this index.  It's either failed to consider it at all, or it has
>>> concluded that it's more efficient to scan the entire jobqueue table
>>> rather than using the index for the lookup.
>>>
>>> FWIW, I did run some explains on similar queries on my test system
>>> this morning before I left for work, and the plans for those involved
>>> something called an "index semi-join" which apparently means it scans
>>> both the jobqueue index and the carrydown index at the same time.  If
>>> I recall correctly, my version of PostgreSQL was 8.3.9.  Since yours
>>> is 8.3.12 it could well be that something was broken in PostgreSQL
>>> between releases.  If you have time, you might try to see if an
>>> earlier version of postgresql exhibits the same behavior for you.  If
>>> that works properly, the next step would be to create a ticket against
>>> PostgreSQL itself - or maybe there already is one.
>>>
>>> I expect that any of the many PostgreSQL books out there would have
>>> enough information in it to bring you up to speed.  The skill set for
>>> optimizing queries is also portable to other databases such as Oracle,
>>> if you care.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Nov 16, 2010 at 12:03 PM,  <hokie@farzad.net> wrote:
>>>> Found a query statement to pull the index data.  Is this what you are
>>>> looking for?  On a separate note, I'm a bit over my head in terms of
>>>> understanding, any recommendation of a site/sites or book to read to
>>>> catch
>>>> up with needed db concepts?
>>>>
>>>> dbname=> select * from pg_indexes where tablename='jobqueue';
>>>>  schemaname | tablename |   indexname    | tablespace |
>>>>              indexdef
>>>>
>
------------+-----------+----------------+------------+---------------------
> ----
>>>> -------------------------------------------------------------
>>>>  public     | jobqueue  | jobqueue_pkey  |            | CREATE
UNIQUE
>>>> INDEX jobq
>>>> ueue_pkey ON jobqueue USING btree (id)
>>>>  public     | jobqueue  | i1289847375565 |            | CREATE
INDEX
>>>> i1289847375
>>>> 565 ON jobqueue USING btree (jobid, status)
>>>>  public     | jobqueue  | i1289847375564 |            | CREATE
INDEX
>>>> i1289847375
>>>> 564 ON jobqueue USING btree (jobid, isseed)
>>>>  public     | jobqueue  | i1289847375563 |            | CREATE
INDEX
>>>> i1289847375
>>>> 563 ON jobqueue USING btree (jobid, dochash, status)
>>>>  public     | jobqueue  | i1289847375562 |            | CREATE
INDEX
>>>> i1289847375
>>>> 562 ON jobqueue USING btree (status)
>>>>  public     | jobqueue  | i1289847375561 |            | CREATE
INDEX
>>>> i1289847375
>>>> 561 ON jobqueue USING btree (checkaction, checktime, status)
>>>>  public     | jobqueue  | i1289847375560 |            | CREATE
INDEX
>>>> i1289847375
>>>> 560 ON jobqueue USING btree (priorityset, status)
>>>>  public     | jobqueue  | i1289847375559 |            | CREATE
INDEX
>>>> i1289847375
>>>> 559 ON jobqueue USING btree (docpriority)
>>>>  public     | jobqueue  | i1289847375558 |            | CREATE
UNIQUE
>>>> INDEX i128
>>>> 9847375558 ON jobqueue USING btree (dochash, jobid)
>>>> (9 rows)
>>>>
>>>> dbname=> select * from pg_indexes where tablename='carrydown';
>>>>  schemaname | tablename |   indexname    | tablespace |
>>>>                               indexdef
>>>>
>>>>
>
------------+-----------+----------------+------------+---------------------
> ----
>>>>
>
----------------------------------------------------------------------------
> ----
>>>> ----------------
>>>>  public     | carrydown | i1289847375547 |            | CREATE INDEX
>>>> i1289847375
>>>> 547 ON carrydown USING btree (jobid, parentidhash)
>>>>  public     | carrydown | i1289847375546 |            | CREATE INDEX
>>>> i1289847375
>>>> 546 ON carrydown USING btree (jobid, childidhash, dataname)
>>>>  public     | carrydown | i1289847375545 |            | CREATE INDEX
>>>> i1289847375
>>>> 545 ON carrydown USING btree (jobid, childidhash, isnew)
>>>>  public     | carrydown | i1289847375544 |            | CREATE UNIQUE
>>>> INDEX i128
>>>> 9847375544 ON carrydown USING btree (jobid, parentidhash, childidhash,
>>>> dataname,
>>>>  datavaluehash)
>>>> (4 rows)
>>>>
>>>> dbname=>
>>>>
>>>> Farzad.
>>>>
>>>>
>>>> psql usually displays the indexes for a table when you dt the table.
>>>> These are not appearing in your postgresql version.  But it's clear
>>>> they are there from the plan.
>>>>
>>>> The jobqueue table is being sequentially scanned *despite* the
>>>> probable fact that there's a reasonable, usable index on it that it
>>>> should be using.  In short, this looks like a PostgreSQL bug to me.
>>>>
>>>> I'm away from my test machine right now, but I'd love to see the
>>>> indexes on the jobqueue table that you have.  There may be a new
>>>> command in psql that my version of PostgreSQL doesn't have that prints
>>>> the indexes for a table, if they took it out of dt.
>>>>
>>>> If the proper index is there, then all we can do is try various forms
>>>> of the query to see if we can dodge the PostgreSQL bug.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Tue, Nov 16, 2010 at 9:31 AM, Farzad Valad <hokie@farzad.net> wrote:
>>>>> Here is the output:
>>>>>
>>>>>
>>>>> C:\Program Files (x86)\PostgreSQL\8.3\bin>psql -U manifoldcf dbname
>>>>> Password for user manifoldcf:
>>>>> Welcome to psql 8.3.12, the PostgreSQL interactive terminal.
>>>>>
>>>>> Type:  \copyright for distribution terms
>>>>>       \h for help with SQL commands
>>>>>       \? for help with psql commands
>>>>>       \g or terminate with semicolon to execute query
>>>>>       \q to quit
>>>>>
>>>>> Warning: Console code page (437) differs from Windows code page (1252)
>>>>>         8-bit characters might not work correctly. See psql reference
>>>>>         page "Notes for Windows users" for details.
>>>>>
>>>>> dbname=> \dt jobqueue;
>>>>>           List of relations
>>>>>  Schema |   Name   | Type  |   Owner
>>>>> --------+----------+-------+------------
>>>>>  public | jobqueue | table | manifoldcf
>>>>> (1 row)
>>>>>
>>>>> dbname=> \dt carrydown;
>>>>>            List of relations
>>>>>  Schema |   Name    | Type  |   Owner
>>>>> --------+-----------+-------+------------
>>>>>  public | carrydown | table | manifoldcf
>>>>> (1 row)
>>>>>
>>>>> dbname=> select count(*) from carrydown;
>>>>>  count
>>>>> -------
>>>>>     0
>>>>> (1 row)
>>>>>
>>>>> dbname=> select distinct jobid from carrydown;
>>>>>  jobid
>>>>> -------
>>>>> (0 rows)
>>>>>
>>>>> dbname=> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0
>>>>> WHERE
>>>>> EXISTS(
>>>>> SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>>>> dbname(> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>>>> t1.childidhash=t0.doch
>>>>> ash AND t0.jobid=1289847478234 AND
>>>>> dbname(> t1.jobid=1289847478234 AND t1.isnew='B');
>>>>>
>>>>>     QUERY PLAN
>>>>>
>>>>>
>
----------------------------------------------------------------------------
>>>>> ----
>>>>>
>
----------------------------------------------------------------------------
>>>>> ----
>>>>> --------------------
>>>>>  Seq Scan on jobqueue t0  (cost=0.00..2000133.07 rows=118994
width=158)
>>>>>   Filter: (subplan)
>>>>>   SubPlan
>>>>>     ->  Result  (cost=0.00..8.28 rows=1 width=0)
>>>>>           One-Time Filter: ($1 = 1289847478234::bigint)
>>>>>           ->  Index Scan using i1289847375544 on carrydown t1
>>>>> (cost=0.00..8.28
>>>>>  rows=1 width=0)
>>>>>                 Index Cond: ((jobid = 1289847478234::bigint)
AND
>>>>> ((parentidhash
>>>>> )::text = 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND
>>>>> ((childidhash)::
>>>>> text = ($0)::text))
>>>>>                 Filter: (isnew = 'B'::bpchar)
>>>>> (8 rows)
>>>>>
>>>>> dbname=> analyze carrydown;
>>>>> ANALYZE
>>>>> dbname=> analyze jobqueue;
>>>>> ANALYZE
>>>>> dbname=> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0
>>>>> WHERE
>>>>> dbname-> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>>>> dbname(> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>>>> dbname(> t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND
>>>>> dbname(> t1.jobid=1289847478234 AND t1.isnew='B');
>>>>>
>>>>>     QUERY PLAN
>>>>>
>>>>>
>
----------------------------------------------------------------------------
>>>>> ----
>>>>>
>
----------------------------------------------------------------------------
>>>>> ----
>>>>> --------------------
>>>>>  Seq Scan on jobqueue t0  (cost=0.00..2091946.86 rows=124532
width=158)
>>>>>   Filter: (subplan)
>>>>>   SubPlan
>>>>>     ->  Result  (cost=0.00..8.28 rows=1 width=0)
>>>>>           One-Time Filter: ($1 = 1289847478234::bigint)
>>>>>           ->  Index Scan using i1289847375544 on carrydown t1
>>>>> (cost=0.00..8.28
>>>>>  rows=1 width=0)
>>>>>                 Index Cond: ((jobid = 1289847478234::bigint)
AND
>>>>> ((parentidhash
>>>>> )::text = 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND
>>>>> ((childidhash)::
>>>>> text = ($0)::text))
>>>>>                 Filter: (isnew = 'B'::bpchar)
>>>>> (8 rows)
>>>>>
>>>>> dbname=>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Karl Wright [mailto:daddywri@gmail.com]
>>>>> Sent: Tuesday, November 16, 2010 4:21 AM
>>>>> To: connectors-user@incubator.apache.org
>>>>> Subject: Re: Performance Problems Continue
>>>>>
>>>>> As I suspected, the plan it is generating is crap:
>>>>>
>>>>> Found a query that took more than a minute (62494 ms): [SELECT
>>>>> t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE EXISTS(SELECT 'x'
>>>>> FROM carrydown t1 WHERE parentidhash IN (?) AND
>>>>> t1.childidhash=t0.dochash AND t0.jobid=? AND t1.jobid=? AND
>>>>> t1.isnew=?)]
>>>>>  Parameter 0: 'B0C2E57717919821A7C32D9F92F9F297155B727F'
>>>>>  Parameter 1: '1289847478234'
>>>>>  Parameter 2: '1289847478234'
>>>>>  Parameter 3: 'B'
>>>>>  Plan: Seq Scan on jobqueue t0  (cost=0.00..1332448.02 rows=79465
>>>>> width=152)
>>>>>  Plan:   Filter: (subplan)
>>>>>  Plan:   SubPlan
>>>>>  Plan:     ->  Result  (cost=0.00..8.28 rows=1 width=0)
>>>>>  Plan:           One-Time Filter: ($1 = 1289847478234::bigint)
>>>>>  Plan:           ->  Index Scan using i1289847375544 on carrydown
t1
>>>>> (cost=0.00..8.28 rows=1 width=0)
>>>>>  Plan:                 Index Cond: ((jobid = 1289847478234::bigint)
>>>>> AND ((parentidhash)::text =
>>>>> 'B0C2E57717919821A7C32D9F92F9F297155B727F'::text) AND
>>>>> ((childidhash)::text = ($0)::text))
>>>>>  Plan:                 Filter: (isnew = 'B'::bpchar)
>>>>>
>>>>>
>>>>>
>>>>> ... which means that it is doing a sequential scan on the jobqueue
>>>>> table (large) and an index scan for each row on the carrydown table
>>>>> (which should be empty), rather than the other way round.  There is
a
>>>>> carrydown index that is should be using instead, which it is
>>>>> completely ignoring, which is all the more weird.
>>>>>
>>>>> Can you do the following:
>>>>>
>>>>> (1) Under the postgresql bin directory, run psql -U manifoldcf dbname
>>>>> (2) If the defaults are in place, password should be "local_pg_passwd"
>>>>> (3) Run the following commands, and cut/paste the answers for me:
>>>>>
>>>>> \dt jobqueue;
>>>>> \dt carrydown;
>>>>> select count(*) from carrydown;
>>>>> select distinct jobid from carrydown;
>>>>> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE
>>>>> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>>>> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>>>> t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND
>>>>> t1.jobid=1289847478234 AND t1.isnew='B');
>>>>>
>>>>> Once you have done that, please do the following:
>>>>>
>>>>> analyze carrydown;
>>>>> analyze jobqueue;
>>>>>
>>>>> Then, try the explain again. and send me that output too:
>>>>>
>>>>> explain SELECT t0.id,t0.dochash,t0.docid FROM jobqueue t0 WHERE
>>>>> EXISTS(SELECT 'x' FROM carrydown t1 WHERE parentidhash IN
>>>>> ('B0C2E57717919821A7C32D9F92F9F297155B727F') AND
>>>>> t1.childidhash=t0.dochash AND t0.jobid=1289847478234 AND
>>>>> t1.jobid=1289847478234 AND t1.isnew='B');
>>>>>
>>>>>
>>>>> I've tried the same commands on my system, but they show reasonable
>>>>> plans.  PostgreSQL, like all databases, develops its plans based on
>>>>> statistics it keeps for the tables in question.  The analyze command
>>>>> tells it to update those statistics, which in theory should make
>>>>> PostgreSQL immediately thereafter use a correct plan.  However, there
>>>>> may be great sensitivity to incorrect statistics for some sorts of
>>>>> query, depending on the conditions, so a small number of accumulated
>>>>> changes can cause it to start to make huge errors.  I am trying to
>>>>> determine if that's what is happening, and what those conditions are
>>>>> in your case.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Mon, Nov 15, 2010 at 9:44 PM, Farzad Valad <hokie@farzad.net>
wrote:
>>>>>> 1. Using 8.3.12, gave up on 9.x
>>>>>> 2. Yes, I do.  Attached the log file for this crawl
>>>>>> 3. Using Null Output and File System input, not using RSS
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>
>


Mime
View raw message