I do not see any exceptions i=
n the log.

Thanks.

Regard=
s,

Shigeki

--

2012/=
7/31 Karl Wright <daddywri@gmail.com>

One more question: do you see any exceptions= in the manifoldcf log file?

Karl

On Mon, Jul 30, 2012 at 7:03 AM, Karl Wright <daddywri@gmail.com> wrote:

> This means that we are seeing some kind of transactional integrity pro= blem

> with MySQL. =C2=A0I have seen hints of this behavior before. =C2=A0It = is not a

> difference in logic. =C2=A0It could be due to either MySQL bugs or sub= tle

> differences in how transactions work in MySQL.

>

> I will try to write a load test that uses hopcount filters in order to= see

> if the problem can be reliably reproduced here. =C2=A0If it turns out = to be a

> MySQL problem there would not be much we could do to fix the issue.

>

> Karl

>

> Sent from my Windows Phone

> ________________________________

> From: Shigeki Kobayashi

> Sent: 7/30/2012 6:36 AM

> To: user@manifoldcf.apac= he.org

> Subject: Re: crawled counts on WEB crawling differ between MCF0.4 and = MCF0.5

>

>

>>(1) Make sure that the repository connections and job definitions a= re

> indeed identical between MySQL and PostgreSQL.

>

> Yes, they are all the same.

>

>>(2) See if you can locate an example document that was crawled with=

> PostgreSQL but not crawled with MySQL.

>

> I confirmed the documents crawled with PostgreSQL but not crawled with= MySQL

> actually exist.

>

>>(3) If you create a second web connection and job under MySQL, and = run

> the job to completion, does the document that was not included get

> skipped again? =C2=A0Or does it seem random which documents are skippe= d on

> each run?

>

> Ok. I created two connections and jobs with exactly same description, = and

> then

> ran the jobs to completion.

> Those run resulted with different number of crawled documents ( as sho= wn in

> the attached picture).

>

> It seems the first run skipped some documents and the second run skipp= ed

> different documents, but all the skipped docs can be located. =C2=A0I = have no

> clue how those docs are skipped.

>

>

> Regards,

>

> Shigeki

>

> 2012/7/30 Karl Wright <daddyw= ri@gmail.com>

>>

>> There should be no differences between crawling using MySQL as the=

>> database and PostgreSQL, on the same version of ManifoldCF.

>>

>> We include an RSS crawling test which finds exactly the expected>> number of documents on MySQL. =C2=A0This is a 100,000 document cra= wl.

>> There are no back-end-specific logic differences in the web connec= tor

>> that would be expected to yield different results based on the

>> back-end database.

>>

>> If you believe you have found a difference between MySQL and

>> PostgreSQL, I suggest the following:

>>

>> (1) Make sure that the repository connections and job definitions = are

>> indeed identical between MySQL and PostgreSQL.

>> (2) See if you can locate an example document that was crawled wit= h

>> PostgreSQL but not crawled with MySQL.

>> (3) If you create a second web connection and job under MySQL, and= run

>> the job to completion, does the document that was not included get=

>> skipped again? =C2=A0Or does it seem random which documents are sk= ipped on

>> each run?

>>

>> Thanks,

>> Karl

>>

>>

>>

>> On Sun, Jul 29, 2012 at 9:51 PM, Shigeki Kobayashi

>> <shigeki= .kobayashi3@g.softbank.co.jp> wrote:

>> > Aren't there some difference in crawling logics between M= ySQL and

>> > PostgreSQL?

>> >

>> >

>> >

>> > I did some tests on web crawling using both of MySQL and Post= greSQL.

>> >

>> >

>> >

>> >

>> >

>> > MCF0.5 running on MySQL indexed around 6000, and meanwhile MC= F0.5

>> > running on

>> > PostgreSQL indexed over 12000 documents.

>> >

>> > MCF0.6 running on MySQL indexed around 6000. MCF0.4 running o= n

>> > PostgreSQL

>> > indexed over 12000 documents.

>> >

>> >

>> >

>> >

>> >

>> > Each number of indexed documents above is a result of first c= rawling

>> > after

>> > deleting indexing history from DB.

>> >

>> > It seems that changing DB affects crawling and indexing.

>> >

>> >

>> >

>> > Regards,

>> >

>> > Shigeki

>> >

>> > 2012/7/27 Karl Wright <daddywri@gmail.com>

>> >>

>> >> There was a bug fixed in the way hopcount was being compu= ted. =C2=A0See

>> >> CONNECTORS-464.

>> >>

>> >> This means that fewer documents are left in the queue, bu= t the number

>> >> of indexed documents should be the same.

>> >>

>> >> Karl

>> >>

>> >> On Fri, Jul 27, 2012 at 3:00 AM, Shigeki Kobayashi

>> >> <shigeki.kobayashi3@g.softbank.co.jp> wrote:

>> >> >

>> >> > Hi guys.

>> >> >

>> >> >

>> >> > I wonder if anyone has ever faced the experience on = web crawling that

>> >> > the

>> >> > number of crawled counts differs between MCF0.4 and= =E3=80=80MCF0.5.

>> >> >

>> >> >

>> >> > I crawled some portal sites on intranet using MCF0.4= and MCF0.5.

>> >> > MCF0.4 crawled over 12000 contents, and meanwhile, M= CF0.5 crawled

>> >> > only

>> >> > around half of the contents.

>> >> > I ran MCF0.4 on PostgreSQL and MCF0.5 on MySQL.

>> >> > I hope changing DB does not affect the crawling resu= lts:

>> >> >

>> >> >

>> >> > MCF0.4:

>> >> > =C2=A0 - Crawled Counts: 12000 and over

>> >> > =C2=A0 - Solr3.5

>> >> > =C2=A0 - PostgreSQL 9.1.3

>> >> > =C2=A0 - Tomcat6

>> >> > =C2=A0 - Max Hop on Links: 15

>> >> > =C2=A0 - Max Hop on Redirects: 10

>> >> > =C2=A0 - Include only hosts matching seeds: Checked<= br> >> >> > =C2=A0 - org.apache.manifoldcf.crawler.threads: 50>> >> > =C2=A0 - org.apache.manifoldcf.database.maxhandles: = 100

>> >> >

>> >> >

>> >> > MCF0.5:

>> >> > =C2=A0 - Crawled Counts: around 6000

>> >> > =C2=A0 - Solr3.5

>> >> > =C2=A0 - MySQL5.5

>> >> > =C2=A0 - Tomcat6

>> >> > =C2=A0 - Max Hop on Links: 15

>> >> > =C2=A0 - Max Hop on Redirects: 10

>> >> > =C2=A0 - Include only hosts matching seeds: Checked<= br> >> >> > =C2=A0 - org.apache.manifoldcf.crawler.threads: 50>> >> > =C2=A0 - org.apache.manifoldcf.database.maxhandles: = 100

>> >> >

>> >> >

>> >> > Does anyone have any ideas?

>> >> >

>> >

>> >

>> >

>> >

>> > --

--20cf302d4d62c4d89804c615c986--