Karl, 


I do not see any exceptions in the log.

Thanks.

Regards,

Shigeki

2012/7/31 Karl Wright <daddywri@gmail.com>
One more question: do you see any exceptions in the manifoldcf log file?

Karl

On Mon, Jul 30, 2012 at 7:03 AM, Karl Wright <daddywri@gmail.com> wrote:
> This means that we are seeing some kind of transactional integrity problem
> with MySQL.  I have seen hints of this behavior before.  It is not a
> difference in logic.  It could be due to either MySQL bugs or subtle
> differences in how transactions work in MySQL.
>
> I will try to write a load test that uses hopcount filters in order to see
> if the problem can be reliably reproduced here.  If it turns out to be a
> MySQL problem there would not be much we could do to fix the issue.
>
> Karl
>
> Sent from my Windows Phone
> ________________________________
> From: Shigeki Kobayashi
> Sent: 7/30/2012 6:36 AM
> To: user@manifoldcf.apache.org
> Subject: Re: crawled counts on WEB crawling differ between MCF0.4 and MCF0.5
>
>
>>(1) Make sure that the repository connections and job definitions are
> indeed identical between MySQL and PostgreSQL.
>
> Yes, they are all the same.
>
>>(2) See if you can locate an example document that was crawled with
> PostgreSQL but not crawled with MySQL.
>
> I confirmed the documents crawled with PostgreSQL but not crawled with MySQL
> actually exist.
>
>>(3) If you create a second web connection and job under MySQL, and run
> the job to completion, does the document that was not included get
> skipped again?  Or does it seem random which documents are skipped on
> each run?
>
> Ok. I created two connections and jobs with exactly same description, and
> then
> ran the jobs to completion.
> Those run resulted with different number of crawled documents ( as shown in
> the attached picture).
>
> It seems the first run skipped some documents and the second run skipped
> different documents, but all the skipped docs can be located.  I have no
> clue how those docs are skipped.
>
>
> Regards,
>
> Shigeki
>
> 2012/7/30 Karl Wright <daddywri@gmail.com>
>>
>> There should be no differences between crawling using MySQL as the
>> database and PostgreSQL, on the same version of ManifoldCF.
>>
>> We include an RSS crawling test which finds exactly the expected
>> number of documents on MySQL.  This is a 100,000 document crawl.
>> There are no back-end-specific logic differences in the web connector
>> that would be expected to yield different results based on the
>> back-end database.
>>
>> If you believe you have found a difference between MySQL and
>> PostgreSQL, I suggest the following:
>>
>> (1) Make sure that the repository connections and job definitions are
>> indeed identical between MySQL and PostgreSQL.
>> (2) See if you can locate an example document that was crawled with
>> PostgreSQL but not crawled with MySQL.
>> (3) If you create a second web connection and job under MySQL, and run
>> the job to completion, does the document that was not included get
>> skipped again?  Or does it seem random which documents are skipped on
>> each run?
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Sun, Jul 29, 2012 at 9:51 PM, Shigeki Kobayashi
>> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>> > Aren't there some difference in crawling logics between MySQL and
>> > PostgreSQL?
>> >
>> >
>> >
>> > I did some tests on web crawling using both of MySQL and PostgreSQL.
>> >
>> >
>> >
>> >
>> >
>> > MCF0.5 running on MySQL indexed around 6000, and meanwhile MCF0.5
>> > running on
>> > PostgreSQL indexed over 12000 documents.
>> >
>> > MCF0.6 running on MySQL indexed around 6000. MCF0.4 running on
>> > PostgreSQL
>> > indexed over 12000 documents.
>> >
>> >
>> >
>> >
>> >
>> > Each number of indexed documents above is a result of first crawling
>> > after
>> > deleting indexing history from DB.
>> >
>> > It seems that changing DB affects crawling and indexing.
>> >
>> >
>> >
>> > Regards,
>> >
>> > Shigeki
>> >
>> > 2012/7/27 Karl Wright <daddywri@gmail.com>
>> >>
>> >> There was a bug fixed in the way hopcount was being computed.  See
>> >> CONNECTORS-464.
>> >>
>> >> This means that fewer documents are left in the queue, but the number
>> >> of indexed documents should be the same.
>> >>
>> >> Karl
>> >>
>> >> On Fri, Jul 27, 2012 at 3:00 AM, Shigeki Kobayashi
>> >> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>> >> >
>> >> > Hi guys.
>> >> >
>> >> >
>> >> > I wonder if anyone has ever faced the experience on web crawling that
>> >> > the
>> >> > number of crawled counts differs between MCF0.4 and MCF0.5.
>> >> >
>> >> >
>> >> > I crawled some portal sites on intranet using MCF0.4 and MCF0.5.
>> >> > MCF0.4 crawled over 12000 contents, and meanwhile, MCF0.5 crawled
>> >> > only
>> >> > around half of the contents.
>> >> > I ran MCF0.4 on PostgreSQL and MCF0.5 on MySQL.
>> >> > I hope changing DB does not affect the crawling results:
>> >> >
>> >> >
>> >> > MCF0.4:
>> >> >   - Crawled Counts: 12000 and over
>> >> >   - Solr3.5
>> >> >   - PostgreSQL 9.1.3
>> >> >   - Tomcat6
>> >> >   - Max Hop on Links: 15
>> >> >   - Max Hop on Redirects: 10
>> >> >   - Include only hosts matching seeds: Checked
>> >> >   - org.apache.manifoldcf.crawler.threads: 50
>> >> >   - org.apache.manifoldcf.database.maxhandles: 100
>> >> >
>> >> >
>> >> > MCF0.5:
>> >> >   - Crawled Counts: around 6000
>> >> >   - Solr3.5
>> >> >   - MySQL5.5
>> >> >   - Tomcat6
>> >> >   - Max Hop on Links: 15
>> >> >   - Max Hop on Redirects: 10
>> >> >   - Include only hosts matching seeds: Checked
>> >> >   - org.apache.manifoldcf.crawler.threads: 50
>> >> >   - org.apache.manifoldcf.database.maxhandles: 100
>> >> >
>> >> >
>> >> > Does anyone have any ideas?
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > ~~~~~~~~~~~~~~~~~~~~~~~~
>> >  ソフトバンクモバイル株式会社
>> >  情報システム本部
>> >  システムサービス事業統括部
>> >  サービス企画部
>> >
>> >  小林 茂樹
>> >  shigeki.kobayashi3@g.softbank.co.jp
>> > ~~~~~~~~~~~~~~~~~~~~~~~~
>> >
>> >
>> >
>
>
>
>
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~
>  ソフトバンクモバイル株式会社
>  情報システム本部
>  システムサービス事業統括部
>  サービス企画部
>
>  小林 茂樹
>  shigeki.kobayashi3@g.softbank.co.jp
> ~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>



--
~~~~~~~~~~~~~~~~~~~~~~~~
 ソフトバンクモバイル株式会社
 情報システム本部
 システムサービス事業統括部
 サービス企画部
 
 小林 茂樹
 shigeki.kobayashi3@g.softbank.co.jp
~~~~~~~~~~~~~~~~~~~~~~~~