incubator-connectors-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Connectors Framework > FAQ
Date Sat, 09 Oct 2010 07:18:00 GMT
Space: Apache Connectors Framework (https://cwiki.apache.org/confluence/display/CONNECTORS)
Page: FAQ (https://cwiki.apache.org/confluence/display/CONNECTORS/FAQ)
Comment: https://cwiki.apache.org/confluence/display/CONNECTORS/FAQ?focusedCommentId=23340303#comment-23340303

Comment added by Karl Wright:
---------------------------------------------------------------------

Another difference I just discovered.  The test set I just used has a significant percentage
of large files (>1 MB), maybe perhaps 25% or so, while the test set in May was all machine-generated
with smaller files (25K), so that too would account for some significant disk-related performance
differences.

So I think it's fair to say that, if you are seeing 5 docs/second, another thing you should
check is whether you are crawling off the same disk your database is on, and how fast those
files can be retrieved by any means.

I'm about to try the same crawl with a Dell Tower that has a reasonably fast disk, stay tuned.




In reply to a comment by Karl Wright:
With PostgreSQL, a somewhat different test set than I used in May, and with a no-doubt much
more fragmented disk, I am getting some 17 documents/second here, now, doing a file-system
crawl to a null output.  Which is 1/2 what I saw in May.

This had the following special postgresql settings:
(1) 100 max connection handles
(2) 256MB shared buffers (which may well have been overkill, but that's what my PostgreSQL
setup had)

Connection/job settings:
(1) 100 max connections of both repository amd output connections.
(2) Hop filters set to "never delete unreachable documents".

System was pretty near totally I/O bound during execution, which leads me to believe that,
since the system was brand-new in May, disk fragmentation was a major factor.  I will try
to run a benchmark where the database is on a different disk than the files being crawled,
maybe today.


Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action

Mime
View raw message