manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Apache ManifoldCF Performance
Date Thu, 11 Sep 2014 15:47:15 GMT
Hi Paul,

Upgrade is included automatically in MCF.

If you are using the multiprocess example, just run the initialize script
and it will do the upgrade.  The single process example does any required
upgrades every time you start it.

Thanks,
Karl



On Thu, Sep 11, 2014 at 11:34 AM, Paul Boichat <paul.boichat@exonar.com>
wrote:

> Thanks - we've pulled down the branch and will test the changes. It looks
> like a branch of 1.7 so it's going to take us a little while to test. We
> need to migrate our connectors (there's some deprecated stuff that's now
> been cleared in 1.7 .eg. getShareACL) and we'll need to patch the database
> to include the pipeline and any other schema changes. We'll have some
> environment contention over the next week as our performance test
> environment needs to remain on 1.6.1 while we test a release. Once that's
> clear I'll move to 1.7
>
> On the database schema patch moving from 1.6.1 to 1.7 - is there a simple
> way to migrate and existing database?
>
> Thanks,
>
> Paul
>
>
>
>
>
> VP Engineering,
> Exonar Ltd
>
> T: +44 7940 567724
>
> twitter:@exonarco @pboichat
> W: http://www.exonar.com
> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>
> Exonar Limited, registered in the UK, registration number 06439969 at 14
> West Mills, Newbury, Berkshire, RG14 5HG
> DISCLAIMER: This email and any attachments to it may be confidential and
> are intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do not
> necessarily represent those of Exonar Ltd. If you are not the intended
> recipient of this email, you must neither take any action based upon its
> contents, nor copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error.
>
> On Thu, Sep 11, 2014 at 1:27 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Thanks -- I'll include that change as well then, in ticket
>> CONNECTORS-1027.
>>
>>
>> Karl
>>
>>
>> On Thu, Sep 11, 2014 at 7:45 AM, Paul Boichat <paul.boichat@exonar.com>
>> wrote:
>>
>>> Hi,
>>>
>>> That comes back immediately with 10001 rows:
>>>
>>> explain analyze SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT
>>> 10001) t;
>>>
>>> QUERY PLAN
>>>
>>>
>>> -----------------------------------------------------------------------------------------------------------------------
>>> ----------------------------------
>>>  Aggregate  (cost=544.08..544.09 rows=1 width=0) (actual
>>> time=9.125..9.125 rows=1 loops=1)
>>>    ->  Limit  (cost=0.00..419.07 rows=10001 width=0) (actual
>>> time=0.033..6.945 rows=10001 loops=1)
>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>> (cost=0.00..431189.31 rows=10290271 width=0) (actual time
>>> =0.031..3.257 rows=10001 loops=1)
>>>                Heap Fetches: 725
>>>  Total runtime: 9.157 ms
>>> (5 rows)
>>>
>>>
>>> Whereas:
>>>
>>> explain analyze SELECT count(*) FROM jobqueue limit 10001;
>>>
>>> QUERY PLAN
>>>
>>>
>>> -----------------------------------------------------------------------------------------------------------------------
>>> ----------------------------------------
>>>  Limit  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>> time=5225.107..5225.109 rows=1 loops=1)
>>>    ->  Aggregate  (cost=456922.99..456923.00 rows=1 width=0) (actual
>>> time=5225.105..5225.106 rows=1 loops=1)
>>>          ->  Index Only Scan using jobqueue_pkey on jobqueue
>>> (cost=0.00..431197.31 rows=10290271 width=0) (actual time
>>> =0.108..3090.848 rows=10370209 loops=1)
>>>                Heap Fetches: 684297
>>>  Total runtime: 5225.151 ms
>>>
>>> Thanks,
>>>
>>> Paul
>>>
>>>
>>>
>>> VP Engineering,
>>> Exonar Ltd
>>>
>>> T: +44 7940 567724
>>>
>>> twitter:@exonarco @pboichat
>>> W: http://www.exonar.com
>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>
>>> Exonar Limited, registered in the UK, registration number 06439969 at 14
>>> West Mills, Newbury, Berkshire, RG14 5HG
>>> DISCLAIMER: This email and any attachments to it may be confidential
>>> and are intended solely for the use of the individual to whom it is
>>> addressed. Any views or opinions expressed are solely those of the author
>>> and do not necessarily represent those of Exonar Ltd. If you are not
>>> the intended recipient of this email, you must neither take any action
>>> based upon its contents, nor copy or show it to anyone. Please contact
>>> the sender if you believe you have received this email in error.
>>>
>>> On Thu, Sep 11, 2014 at 12:25 PM, Karl Wright <daddywri@gmail.com>
>>> wrote:
>>>
>>>> Hi Paul,
>>>>
>>>> Could you try this query on your database please and tell me if it
>>>> executes promptly:
>>>>
>>>> SELECT count(*) FROM (SELECT 'x' FROM jobqueue LIMIT 10001) t
>>>>
>>>>
>>>> I vaguely remember that I had to change the form of this query in order
>>>> to support MySQL -- but first let's see if this helps.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Sep 11, 2014 at 6:01 AM, Karl Wright <daddywri@gmail.com>
>>>> wrote:
>>>>
>>>>> I've created a ticket (CONNECTORS-1027) and a trunk-based branch
>>>>> (branches/CONNECTORS-1027) for looking at any changes we do for large-scale
>>>>> Postgresql optimization work.
>>>>>
>>>>> Please note that trunk code already has schema changes relative to MCF
>>>>> 1.7, so you will not be able to work directly with this branch code.
 I'll
>>>>> have to create patches for whatever changes you would need to try.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Sep 11, 2014 at 5:56 AM, Paul Boichat <paul.boichat@exonar.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We're on Postgres 9.2. I'll get the query plans and add them to the
>>>>>> doc.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>> VP Engineering,
>>>>>> Exonar Ltd
>>>>>>
>>>>>> T: +44 7940 567724
>>>>>>
>>>>>> twitter:@exonarco @pboichat
>>>>>> W: http://www.exonar.com
>>>>>> Nothing is secure. Now what? Exonar Raven <http://video.exonar.com/>
>>>>>>
>>>>>> Exonar Limited, registered in the UK, registration number 06439969
at 14
>>>>>> West Mills, Newbury, Berkshire, RG14 5HG
>>>>>> DISCLAIMER: This email and any attachments to it may be confidential
>>>>>> and are intended solely for the use of the individual to whom it
is
>>>>>> addressed. Any views or opinions expressed are solely those of the
author
>>>>>> and do not necessarily represent those of Exonar Ltd. If you are
not
>>>>>> the intended recipient of this email, you must neither take any action
>>>>>> based upon its contents, nor copy or show it to anyone. Please
>>>>>> contact the sender if you believe you have received this email in
error.
>>>>>>
>>>>>> On Thu, Sep 11, 2014 at 10:51 AM, Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> Can you include the logged plan for this query; this is an actual
>>>>>>> query encountered during crawling:
>>>>>>>
>>>>>>> WARN 2014-09-05 12:43:39,897 (Worker thread '61') - Found a
>>>>>>> long-running query (596499 ms): [SELECT t0.id,t0.dochash,t0.docid
>>>>>>> FROM carrydown t1, jobqueue t0 WHERE t1.jobid=? AND t1.parentidhash=?
AND
>>>>>>> t0.dochash=t1.childidhash AND t0.jobid=t1.jobid AND t1.isnew=?]
>>>>>>>
>>>>>>>
>>>>>>> These queries are all from the UI; it is what gets generated
when no
>>>>>>> limits are in place:
>>>>>>>
>>>>>>>  WARN 2014-09-05 12:33:47,445 (http-apr-8081-exec-2) - Found
a
>>>>>>> long-running query (166845 ms): [SELECT jobid,COUNT(dochash)
AS doccount
>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>  WARN 2014-09-05 12:33:47,908 (http-apr-8081-exec-3) - Found
a
>>>>>>> long-running query (107222 ms): [SELECT jobid,COUNT(dochash)
AS doccount
>>>>>>> FROM jobqueue t1 GROUP BY jobid]
>>>>>>>
>>>>>>> This query is from the UI with a limit of 1000000:
>>>>>>>
>>>>>>> WARN 2014-09-05 12:33:45,390 (http-apr-8081-exec-10) - Found
a
>>>>>>> long-running query (254851 ms): [SELECT COUNT(dochash) AS doccount
FROM
>>>>>>> jobqueue t1 LIMIT 1000001]
>>>>>>>
>>>>>>> I honestly don't understand why PostgreSQL would execute a
>>>>>>> sequential scan of the entire table when given a limit clause.
 It
>>>>>>> certainly didn't used to do that.  If you have any other suggestions
please
>>>>>>> let me know.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Some queries show up in this list because MCF periodically reindexes
>>>>>>> tables.  For example, this query goes only against the (small)
jobs table.
>>>>>>> Its poor performance on occasion is likely due to something else
happening
>>>>>>> to the database, probably a reindex:
>>>>>>>
>>>>>>>  WARN 2014-09-05 12:43:40,404 (Finisher thread) - Found a
>>>>>>> long-running query (592474 ms): [SELECT id FROM jobs WHERE status
IN
>>>>>>> (?,?,?,?,?) FOR UPDATE]
>>>>>>>
>>>>>>>
>>>>>>> The final query is the document stuffing query, which is perhaps
the
>>>>>>> most critical query in the whole system:
>>>>>>>
>>>>>>>  SELECT
>>>>>>>  t0.id
>>>>>>> ,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,
>>>>>>>  t0.priorityset FROM jobqueue t0
>>>>>>>  WHERE t0.status IN ('P','G')  AND t0.checkaction='R' AND
>>>>>>> t0.checktime
>>>>>>>  <= 1407246846166
>>>>>>>  AND EXISTS (
>>>>>>>    SELECT 'x' FROM jobs t1
>>>>>>>    WHERE t1.status  IN ('A','a')  AND t1.id=t0.jobid  AND
>>>>>>> t1.priority=5
>>>>>>>  )
>>>>>>>  AND NOT EXISTS (
>>>>>>>    SELECT 'x' FROM jobqueue t2
>>>>>>>    WHERE t2.dochash=t0.dochash AND t2.status IN
>>>>>>>  ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid
>>>>>>>  )
>>>>>>>  AND NOT EXISTS (
>>>>>>>    SELECT 'x' FROM prereqevents t3,events t4
>>>>>>>    WHERE t0.id=t3.owner AND t3.eventname=t4.name
>>>>>>>  )
>>>>>>>  ORDER BY t0.docpriority ASC
>>>>>>>  LIMIT 480;
>>>>>>>
>>>>>>> Your analysis of whether IN beats OR does not agree with experiments
>>>>>>> I did on postgresql 8.7 which showed no difference.  What Postgresql
>>>>>>> version are you using?  Also, I trust you have query plans that
demonstrate
>>>>>>> your claim?  In any case, whether IN vs. OR is generated is a
function of
>>>>>>> the MCF database driver, so this is trivial to experiment with.
 I'll
>>>>>>> create a ticket and a branch for experimentation.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 11, 2014 at 5:32 AM, Paul Boichat <
>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>
>>>>>>>> Hi Karl,
>>>>>>>>
>>>>>>>> Changing maxcountstatus to something much smaller (10,000)
doesn't
>>>>>>>> seem to buy us that much on the table scan - in the attached
you'll see
>>>>>>>> that it's still taking a long time to return the job status
page. Also in
>>>>>>>> the attached are some sample other long running queries that
we're
>>>>>>>> beginning to see more frequently. There's also an example
of a query that's
>>>>>>>> frequently executed and regularly takes > 4 secs (plus
a suggested change
>>>>>>>> to improve performance). This one in particular would certainly
benefit
>>>>>>>> from a change to SSDs which should relieve the I/O bound
bottleneck on
>>>>>>>> postgres.
>>>>>>>>
>>>>>>>> We're loading the system from 10mil towards 100mil so would
be keen
>>>>>>>> to work with you to optimise where possible.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Paul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> VP Engineering,
>>>>>>>> Exonar Ltd
>>>>>>>>
>>>>>>>> T: +44 7940 567724
>>>>>>>>
>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>> W: http://www.exonar.com
>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>> <http://video.exonar.com/>
>>>>>>>>
>>>>>>>> Exonar Limited, registered in the UK, registration number
06439969
>>>>>>>> at 14 West Mills, Newbury, Berkshire, RG14 5HG
>>>>>>>> DISCLAIMER: This email and any attachments to it may be
>>>>>>>> confidential and are intended solely for the use of the individual
to whom
>>>>>>>> it is addressed. Any views or opinions expressed are solely
those of the
>>>>>>>> author and do not necessarily represent those of Exonar Ltd.
If
>>>>>>>> you are not the intended recipient of this email, you must
neither take any
>>>>>>>> action based upon its contents, nor copy or show it to anyone.
Please
>>>>>>>> contact the sender if you believe you have received this
email in error.
>>>>>>>>
>>>>>>>> On Wed, Sep 10, 2014 at 6:34 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Paul,
>>>>>>>>>
>>>>>>>>> The jobstatus query that uses count(*) should be doing
something
>>>>>>>>> like this when the maxdocumentstatuscount value is set:
>>>>>>>>>
>>>>>>>>> select count(*) from jobqueue where xxx limit 500001
>>>>>>>>>
>>>>>>>>> This will still do a sequential scan, but it will be
an aborted
>>>>>>>>> one, so you can control the maximum amount of time spent
doing the query.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 10, 2014 at 1:23 PM, Paul Boichat <
>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We've had a play with maxstatuscount and couldn't
stop it from
>>>>>>>>>> count(*)-ing but I'll certainly have another look
to see if we've missed
>>>>>>>>>> something.
>>>>>>>>>>
>>>>>>>>>> We're increasingly seeing long running threads and
I'll put
>>>>>>>>>> together some samples. As an example, on a job that's
currently aborting:
>>>>>>>>>>
>>>>>>>>>> WARN 2014-09-10 18:37:29,900 (Job reset thread) -
Found a
>>>>>>>>>> long-running query (72902 ms): [UPDATE jobqueue SET
>>>>>>>>>> docpriority=?,priorityset=NULL WHERE jobid=?]
>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread)
-   Parameter 0:
>>>>>>>>>> '1.000000001E9'
>>>>>>>>>>  WARN 2014-09-10 18:37:29,900 (Job reset thread)
-   Parameter 1:
>>>>>>>>>> '1407144048075'
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread)
-  Plan: Update
>>>>>>>>>> on jobqueue  (cost=18806.08..445770.39 rows=764916
width=287)
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread)
-  Plan:   ->
>>>>>>>>>> Bitmap Heap Scan on jobqueue  (cost=18806.08..445770.39
rows=764916
>>>>>>>>>> width=287)
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread)
-  Plan:
>>>>>>>>>> Recheck Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread)
-  Plan:
>>>>>>>>>> ->  Bitmap Index Scan on i1392985450177  (cost=0.00..18614.85
rows=764916
>>>>>>>>>> width=0)
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread)
-
>>>>>>>>>> Plan:               Index Cond: (jobid = 1407144048075::bigint)
>>>>>>>>>>  WARN 2014-09-10 18:37:29,960 (Job reset thread)
-
>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread)
-  Stats:
>>>>>>>>>> n_distinct=4.0 most_common_vals={G,C,Z,P}
>>>>>>>>>> most_common_freqs={0.40676665,0.36629999,0.16606666,0.060866665}
>>>>>>>>>>  WARN 2014-09-10 18:37:30,140 (Job reset thread)
-
>>>>>>>>>>
>>>>>>>>>> Paul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> VP Engineering,
>>>>>>>>>> Exonar Ltd
>>>>>>>>>>
>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>
>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>
>>>>>>>>>> Exonar Limited, registered in the UK, registration
number
>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire, RG14
5HG
>>>>>>>>>> DISCLAIMER: This email and any attachments to it
may be
>>>>>>>>>> confidential and are intended solely for the use
of the individual to whom
>>>>>>>>>> it is addressed. Any views or opinions expressed
are solely those of the
>>>>>>>>>> author and do not necessarily represent those of
Exonar Ltd. If
>>>>>>>>>> you are not the intended recipient of this email,
you must neither take any
>>>>>>>>>> action based upon its contents, nor copy or show
it to anyone. Please
>>>>>>>>>> contact the sender if you believe you have received
this email in error.
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 10, 2014 at 6:14 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>
>>>>>>>>>>> For the jobqueue scans from the UI, there is
a parameter you can
>>>>>>>>>>> set which limits the number of documents counted
to at most a specified
>>>>>>>>>>> amount.  This uses a limit clause, which should
prevent unbounded time
>>>>>>>>>>> doing these kinds of queries:
>>>>>>>>>>>
>>>>>>>>>>> org.apache.manifoldcf.ui.maxstatuscount
>>>>>>>>>>>
>>>>>>>>>>> The documentation says that the default value
for this parameter
>>>>>>>>>>> is 10000, which however is incorrect.  The actual
true default is 500000.
>>>>>>>>>>> You could set that lower for better UI performance
(losing some
>>>>>>>>>>> information, of course.)
>>>>>>>>>>>
>>>>>>>>>>> As for long-running queries, a lot of time and
effort has been
>>>>>>>>>>> spent in MCF to insure that this doesn't happen.
 Specifically, the main
>>>>>>>>>>> document queuing query is structured to read
directly out of a specific
>>>>>>>>>>> jobqueue index.  This is the crucial query that
must work properly for
>>>>>>>>>>> scalability, since doing a query that is effectively
just a sort on the
>>>>>>>>>>> entire jobqueue would be a major problem.  There
are some times where
>>>>>>>>>>> Postgresql's optimizer fails to do the right
thing here, mostly because it
>>>>>>>>>>> makes a huge distinction between whether there's
zero of something or one
>>>>>>>>>>> of something, but you can work around that particular
issue by setting the
>>>>>>>>>>> analyze count to 1 if you start to see this problem
-- which basically
>>>>>>>>>>> means that reanalysis of the table has to occur
on every stuffing query.
>>>>>>>>>>>
>>>>>>>>>>> I'd appreciate seeing the queries that are long-running
in your
>>>>>>>>>>> case so that I can see if that is what you are
encountering or not.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 10, 2014 at 1:01 PM, Paul Boichat
<
>>>>>>>>>>> paul.boichat@exonar.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>
>>>>>>>>>>>> We're beginning to see issues with a document
count > 10
>>>>>>>>>>>> million. At that point, even with good postgres
vacuuming the
>>>>>>>>>>>> jobqueue table is starting to become a bottleneck.
>>>>>>>>>>>>
>>>>>>>>>>>> For example select count(*) from jobqueue,
which is executed
>>>>>>>>>>>> when querying job status will do a full table
scan of jobqueue
>>>>>>>>>>>> which has more than 10 million rows. That's
going to take some time in
>>>>>>>>>>>> postgres.
>>>>>>>>>>>>
>>>>>>>>>>>> SSDs will certainly make a big difference
to document
>>>>>>>>>>>> processing through-put (which we see is largely
I/O bound in
>>>>>>>>>>>> postgres) but we are increasingly seeing
long running queries
>>>>>>>>>>>> in the logs. Our current thinking is that
we'll need to refactor
>>>>>>>>>>>> JobQueue somewhat to optimise queries and,
potentially
>>>>>>>>>>>> partition jobqueue into a subset of tables
(table per queue
>>>>>>>>>>>> for example).
>>>>>>>>>>>>
>>>>>>>>>>>> Paul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> VP Engineering,
>>>>>>>>>>>> Exonar Ltd
>>>>>>>>>>>>
>>>>>>>>>>>> T: +44 7940 567724
>>>>>>>>>>>>
>>>>>>>>>>>> twitter:@exonarco @pboichat
>>>>>>>>>>>> W: http://www.exonar.com
>>>>>>>>>>>> Nothing is secure. Now what? Exonar Raven
>>>>>>>>>>>> <http://video.exonar.com/>
>>>>>>>>>>>>
>>>>>>>>>>>> Exonar Limited, registered in the UK, registration
number
>>>>>>>>>>>> 06439969 at 14 West Mills, Newbury, Berkshire,
RG14 5HG
>>>>>>>>>>>> DISCLAIMER: This email and any attachments
to it may be
>>>>>>>>>>>> confidential and are intended solely for
the use of the individual to whom
>>>>>>>>>>>> it is addressed. Any views or opinions expressed
are solely those of the
>>>>>>>>>>>> author and do not necessarily represent those
of Exonar Ltd. If
>>>>>>>>>>>> you are not the intended recipient of this
email, you must neither take any
>>>>>>>>>>>> action based upon its contents, nor copy
or show it to anyone. Please
>>>>>>>>>>>> contact the sender if you believe you have
received this email in error.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 10, 2014 at 3:15 PM, Karl Wright
<
>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Baptiste,
>>>>>>>>>>>>>
>>>>>>>>>>>>> ManifoldCF is not limited by the number
of agents processes or
>>>>>>>>>>>>> parallel connectors.  Overall database
performance is the limiting factor.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would read this:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also, there's a section in ManifoldCF
(I believe Chapter 2)
>>>>>>>>>>>>> that discusses this issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Some five years ago, I successfully crawled
5 million web
>>>>>>>>>>>>> documents, using Postgresql 8.3.  Postgresql
9.x is faster, and with modern
>>>>>>>>>>>>> SSD's, I expect that you will do even
better.  In general, I'd say it was
>>>>>>>>>>>>> fine to shoot for 10M - 100M documents
on ManifoldCF, provided that you use
>>>>>>>>>>>>> a good database, and provided that you
maintain it properly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 10, 2014 at 10:07 AM, Baptiste
Berthier <
>>>>>>>>>>>>> ba.berthier@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would like to know what is the
maximum number of documents
>>>>>>>>>>>>>> that you managed to crawl with ManifoldCF
and with how many connectors in
>>>>>>>>>>>>>> parallel it could works ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for your answer
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Baptiste
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message