manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shigeki Kobayashi <shigeki.kobayas...@g.softbank.co.jp>
Subject Re: crawled counts on WEB crawling differ between MCF0.4 and MCF0.5
Date Mon, 30 Jul 2012 01:51:01 GMT
Aren't there some difference in crawling logics between MySQL and
PostgreSQL?



I did some tests on web crawling using both of MySQL and PostgreSQL.





MCF0.5 running on MySQL indexed around 6000, and meanwhile MCF0.5 running
on PostgreSQL indexed over 12000 documents.

MCF0.6 running on MySQL indexed around 6000. MCF0.4 running on PostgreSQL
indexed over 12000 documents.





Each number of indexed documents above is a result of first crawling after
deleting indexing history from DB.

It seems that changing DB affects crawling and indexing.


Regards,

Shigeki

2012/7/27 Karl Wright <daddywri@gmail.com>

> There was a bug fixed in the way hopcount was being computed.  See
> CONNECTORS-464.
>
> This means that fewer documents are left in the queue, but the number
> of indexed documents should be the same.
>
> Karl
>
> On Fri, Jul 27, 2012 at 3:00 AM, Shigeki Kobayashi
> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
> >
> > Hi guys.
> >
> >
> > I wonder if anyone has ever faced the experience on web crawling that the
> > number of crawled counts differs between MCF0.4 and MCF0.5.
> >
> >
> > I crawled some portal sites on intranet using MCF0.4 and MCF0.5.
> > MCF0.4 crawled over 12000 contents, and meanwhile, MCF0.5 crawled only
> > around half of the contents.
> > I ran MCF0.4 on PostgreSQL and MCF0.5 on MySQL.
> > I hope changing DB does not affect the crawling results:
> >
> >
> > MCF0.4:
> >   - Crawled Counts: 12000 and over
> >   - Solr3.5
> >   - PostgreSQL 9.1.3
> >   - Tomcat6
> >   - Max Hop on Links: 15
> >   - Max Hop on Redirects: 10
> >   - Include only hosts matching seeds: Checked
> >   - org.apache.manifoldcf.crawler.threads: 50
> >   - org.apache.manifoldcf.database.maxhandles: 100
> >
> >
> > MCF0.5:
> >   - Crawled Counts: around 6000
> >   - Solr3.5
> >   - MySQL5.5
> >   - Tomcat6
> >   - Max Hop on Links: 15
> >   - Max Hop on Redirects: 10
> >   - Include only hosts matching seeds: Checked
> >   - org.apache.manifoldcf.crawler.threads: 50
> >   - org.apache.manifoldcf.database.maxhandles: 100
> >
> >
> > Does anyone have any ideas?
> >
>



-- 
*~~~~~~~~~~~~~~~~~~~~**~~~~*
 ソフトバンクモバイル株式会社
 情報システム本部
 システムサービス事業統括部
 サービス企画部

 小林 茂樹
 shigeki.kobayashi3@g.softbank.co.jp
*~~~~~~~~~~~~~~~~~~~~**~~~~*

Mime
View raw message