manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shigeki Kobayashi <shigeki.kobayas...@g.softbank.co.jp>
Subject Re: crawled counts on WEB crawling differ between MCF0.4 and MCF0.5
Date Tue, 31 Jul 2012 01:05:47 GMT
Karl,


I do not see any exceptions in the log.

Thanks.

Regards,

Shigeki

2012/7/31 Karl Wright <daddywri@gmail.com>

> One more question: do you see any exceptions in the manifoldcf log file?
>
> Karl
>
> On Mon, Jul 30, 2012 at 7:03 AM, Karl Wright <daddywri@gmail.com> wrote:
> > This means that we are seeing some kind of transactional integrity
> problem
> > with MySQL.  I have seen hints of this behavior before.  It is not a
> > difference in logic.  It could be due to either MySQL bugs or subtle
> > differences in how transactions work in MySQL.
> >
> > I will try to write a load test that uses hopcount filters in order to
> see
> > if the problem can be reliably reproduced here.  If it turns out to be a
> > MySQL problem there would not be much we could do to fix the issue.
> >
> > Karl
> >
> > Sent from my Windows Phone
> > ________________________________
> > From: Shigeki Kobayashi
> > Sent: 7/30/2012 6:36 AM
> > To: user@manifoldcf.apache.org
> > Subject: Re: crawled counts on WEB crawling differ between MCF0.4 and
> MCF0.5
> >
> >
> >>(1) Make sure that the repository connections and job definitions are
> > indeed identical between MySQL and PostgreSQL.
> >
> > Yes, they are all the same.
> >
> >>(2) See if you can locate an example document that was crawled with
> > PostgreSQL but not crawled with MySQL.
> >
> > I confirmed the documents crawled with PostgreSQL but not crawled with
> MySQL
> > actually exist.
> >
> >>(3) If you create a second web connection and job under MySQL, and run
> > the job to completion, does the document that was not included get
> > skipped again?  Or does it seem random which documents are skipped on
> > each run?
> >
> > Ok. I created two connections and jobs with exactly same description, and
> > then
> > ran the jobs to completion.
> > Those run resulted with different number of crawled documents ( as shown
> in
> > the attached picture).
> >
> > It seems the first run skipped some documents and the second run skipped
> > different documents, but all the skipped docs can be located.  I have no
> > clue how those docs are skipped.
> >
> >
> > Regards,
> >
> > Shigeki
> >
> > 2012/7/30 Karl Wright <daddywri@gmail.com>
> >>
> >> There should be no differences between crawling using MySQL as the
> >> database and PostgreSQL, on the same version of ManifoldCF.
> >>
> >> We include an RSS crawling test which finds exactly the expected
> >> number of documents on MySQL.  This is a 100,000 document crawl.
> >> There are no back-end-specific logic differences in the web connector
> >> that would be expected to yield different results based on the
> >> back-end database.
> >>
> >> If you believe you have found a difference between MySQL and
> >> PostgreSQL, I suggest the following:
> >>
> >> (1) Make sure that the repository connections and job definitions are
> >> indeed identical between MySQL and PostgreSQL.
> >> (2) See if you can locate an example document that was crawled with
> >> PostgreSQL but not crawled with MySQL.
> >> (3) If you create a second web connection and job under MySQL, and run
> >> the job to completion, does the document that was not included get
> >> skipped again?  Or does it seem random which documents are skipped on
> >> each run?
> >>
> >> Thanks,
> >> Karl
> >>
> >>
> >>
> >> On Sun, Jul 29, 2012 at 9:51 PM, Shigeki Kobayashi
> >> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
> >> > Aren't there some difference in crawling logics between MySQL and
> >> > PostgreSQL?
> >> >
> >> >
> >> >
> >> > I did some tests on web crawling using both of MySQL and PostgreSQL.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > MCF0.5 running on MySQL indexed around 6000, and meanwhile MCF0.5
> >> > running on
> >> > PostgreSQL indexed over 12000 documents.
> >> >
> >> > MCF0.6 running on MySQL indexed around 6000. MCF0.4 running on
> >> > PostgreSQL
> >> > indexed over 12000 documents.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > Each number of indexed documents above is a result of first crawling
> >> > after
> >> > deleting indexing history from DB.
> >> >
> >> > It seems that changing DB affects crawling and indexing.
> >> >
> >> >
> >> >
> >> > Regards,
> >> >
> >> > Shigeki
> >> >
> >> > 2012/7/27 Karl Wright <daddywri@gmail.com>
> >> >>
> >> >> There was a bug fixed in the way hopcount was being computed.  See
> >> >> CONNECTORS-464.
> >> >>
> >> >> This means that fewer documents are left in the queue, but the number
> >> >> of indexed documents should be the same.
> >> >>
> >> >> Karl
> >> >>
> >> >> On Fri, Jul 27, 2012 at 3:00 AM, Shigeki Kobayashi
> >> >> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
> >> >> >
> >> >> > Hi guys.
> >> >> >
> >> >> >
> >> >> > I wonder if anyone has ever faced the experience on web crawling
> that
> >> >> > the
> >> >> > number of crawled counts differs between MCF0.4 and MCF0.5.
> >> >> >
> >> >> >
> >> >> > I crawled some portal sites on intranet using MCF0.4 and MCF0.5.
> >> >> > MCF0.4 crawled over 12000 contents, and meanwhile, MCF0.5 crawled
> >> >> > only
> >> >> > around half of the contents.
> >> >> > I ran MCF0.4 on PostgreSQL and MCF0.5 on MySQL.
> >> >> > I hope changing DB does not affect the crawling results:
> >> >> >
> >> >> >
> >> >> > MCF0.4:
> >> >> >   - Crawled Counts: 12000 and over
> >> >> >   - Solr3.5
> >> >> >   - PostgreSQL 9.1.3
> >> >> >   - Tomcat6
> >> >> >   - Max Hop on Links: 15
> >> >> >   - Max Hop on Redirects: 10
> >> >> >   - Include only hosts matching seeds: Checked
> >> >> >   - org.apache.manifoldcf.crawler.threads: 50
> >> >> >   - org.apache.manifoldcf.database.maxhandles: 100
> >> >> >
> >> >> >
> >> >> > MCF0.5:
> >> >> >   - Crawled Counts: around 6000
> >> >> >   - Solr3.5
> >> >> >   - MySQL5.5
> >> >> >   - Tomcat6
> >> >> >   - Max Hop on Links: 15
> >> >> >   - Max Hop on Redirects: 10
> >> >> >   - Include only hosts matching seeds: Checked
> >> >> >   - org.apache.manifoldcf.crawler.threads: 50
> >> >> >   - org.apache.manifoldcf.database.maxhandles: 100
> >> >> >
> >> >> >
> >> >> > Does anyone have any ideas?
> >> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > ~~~~~~~~~~~~~~~~~~~~~~~~
> >> >  ソフトバンクモバイル株式会社
> >> >  情報システム本部
> >> >  システムサービス事業統括部
> >> >  サービス企画部
> >> >
> >> >  小林 茂樹
> >> >  shigeki.kobayashi3@g.softbank.co.jp
> >> > ~~~~~~~~~~~~~~~~~~~~~~~~
> >> >
> >> >
> >> >
> >
> >
> >
> >
> > --
> > ~~~~~~~~~~~~~~~~~~~~~~~~
> >  ソフトバンクモバイル株式会社
> >  情報システム本部
> >  システムサービス事業統括部
> >  サービス企画部
> >
> >  小林 茂樹
> >  shigeki.kobayashi3@g.softbank.co.jp
> > ~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >
> >
>



-- 
*~~~~~~~~~~~~~~~~~~~~**~~~~*
 ソフトバンクモバイル株式会社
 情報システム本部
 システムサービス事業統括部
 サービス企画部

 小林 茂樹
 shigeki.kobayashi3@g.softbank.co.jp
*~~~~~~~~~~~~~~~~~~~~**~~~~*

Mime
View raw message