manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: crawled counts on WEB crawling differ between MCF0.4 and MCF0.5
Date Wed, 08 Aug 2012 10:37:47 GMT
I've shortened the test so that it runs in 2 hours on PostgreSQL, but
I've run into the problem that even the PostgreSQL test produces
different counts on every run.  I've created the ticket CONNECTORS-501
to track this issue.  I'll let you know when this is resolved.  It may
be a week or two, since it looks like it will be somewhat difficult to
diagnose.

Karl

On Sun, Aug 5, 2012 at 8:59 PM, Shigeki Kobayashi
<shigeki.kobayashi3@g.softbank.co.jp> wrote:
> Karl,
>
> I was also testing the latest commit and it was too slow.
> I will wait for you back.
>
> Regards,
>
> Shigeki
>
> 2012/8/3 Karl Wright <daddywri@gmail.com>
>>
>> I'm running it here and it is pretty much too slow to ever finish.
>> Mysqld is chugging for minutes at a time with little apparent
>> progress.
>>
>> I'll have to look into this further when I get back on Tuesday.
>>
>> Karl
>>
>> On Fri, Aug 3, 2012 at 5:10 AM, Karl Wright <daddywri@gmail.com> wrote:
>> > Hi Shigeki,
>> >
>> > It turns out that the test has not been passing for me, but merely
>> > timing out.  I've increased the timeout now and committed that change.
>> >  Can you stop your test, drop your "testdb" database, and start the
>> > test over again?  A successful test will print the following before
>> > printing any shutdown or cleanup messages:
>> >
>> >     System.err.println("Crawl required "+new
>> > Long(System.currentTimeMillis()-startTime).toString()+"
>> > milliseconds");
>> >
>> > Karl
>> >
>> >
>> >
>> > On Fri, Aug 3, 2012 at 5:05 AM, Karl Wright <daddywri@gmail.com> wrote:
>> >> If the test starts to clean up and then hangs, I believe that means
>> >> that it passed.  There is also a problem with the test cleanup code
>> >> which is unrelated that I need to look at.
>> >>
>> >> Thanks,
>> >> Karl
>> >>
>> >> On Fri, Aug 3, 2012 at 3:23 AM, Shigeki Kobayashi
>> >> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>> >>> Hi Karl,
>> >>>
>> >>> I figured out where to put the mysql driver. I put
>> >>> mysql-connector-5.x.x.jar in MCF_HOME/lib_proprietary/ then the error
>> >>> was
>> >>> resolved.
>> >>>
>> >>> I also had to modify
>> >>>
>> >>> MCF_HOME/framework/core/src/test/java/org/apache/manifoldcf/core/tests/BaseMySQL.java
>> >>>  to change the root password for MySQL.
>> >>>
>> >>> I still get a warning saying "Preclean failed: Error getting
>> >>> connection:
>> >>> Access denied for user 'testuser'@'localhost' (using password: YES)".
>> >>> I
>> >>> don't know how to fix this.  Am I supposed to set password for
>> >>> 'testuser'?
>> >>>
>> >>> The program seems to be running at the mcf-test-build.run-load-mysql
>> >>> phase.
>> >>>
>> >>> I will let you know when it's done.
>> >>>
>> >>>
>> >>> Thanks
>> >>>
>> >>>
>> >>> Regards,
>> >>>
>> >>> Shigeki
>> >>>
>> >>> 2012/8/3 Shigeki Kobayashi <shigeki.kobayashi3@g.softbank.co.jp>
>> >>>>
>> >>>> Hi Karl,
>> >>>>
>> >>>> I executed the following:
>> >>>>
>> >>>> ant run-webcrawler-loadtests-mysql
>> >>>>
>> >>>> I recieved an error saying "Unable to load database driver:
>> >>>> com.mysql.jdbc.Driver"
>> >>>>
>> >>>> I suppose I have to put mysql-connector-5.x.x.jar somewhere in order
>> >>>> to
>> >>>> build the test. If so which directory am I supposed to put in?
>> >>>>
>> >>>> Please let me know.
>> >>>>
>> >>>> Regards,
>> >>>>
>> >>>> Shigeki
>> >>>> 2012/8/3 Karl Wright <daddywri@gmail.com>
>> >>>>>
>> >>>>> A test has been created for both Postgresql and for MySQL. 
If you
>> >>>>> check out trunk, you can run the tests like this:
>> >>>>>
>> >>>>> ant run-webcrawler-loadtests-postgresql
>> >>>>>
>> >>>>> and
>> >>>>>
>> >>>>> ant run-webcrawler-loadtests-mysql
>> >>>>>
>> >>>>> I've run the Postgresql test here on Windows and it succeeds.
 Can
>> >>>>> you
>> >>>>> confirm that the mysql test fails for you?
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Karl
>> >>>>>
>> >>>>>
>> >>>>> On Tue, Jul 31, 2012 at 8:03 AM, Karl Wright <daddywri@gmail.com>
>> >>>>> wrote:
>> >>>>> > I've created CONNECTORS-496 to track this issue.
>> >>>>> >
>> >>>>> > Karl
>> >>>>> >
>> >>>>> >
>> >>>>> > On Tue, Jul 31, 2012 at 3:11 AM, Shigeki Kobayashi
>> >>>>> > <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>> >>>>> >> Hi Karl
>> >>>>> >>
>> >>>>> >>
>> >>>>> >> I use MySQL5.5 and CentOS5.8.
>> >>>>> >> I did not make any MySQL setting. I just specified
the manifold's
>> >>>>> >> database
>> >>>>> >> maxhandles to 100.
>> >>>>> >>
>> >>>>> >> Regards,
>> >>>>> >>
>> >>>>> >> Shigeki
>> >>>>> >>
>> >>>>> >>
>> >>>>> >> 2012/7/31 Karl Wright <daddywri@gmail.com>
>> >>>>> >>>
>> >>>>> >>> Hi Shigeki,
>> >>>>> >>>
>> >>>>> >>> With the standard MySQL load test, with throttling
wide open,
>> >>>>> >>> running
>> >>>>> >>> on Windows Vista, I get very poor overall performance
and
>> >>>>> >>> parallelism
>> >>>>> >>> - indeed, it's so poor that I doubt there is much
parallelism at
>> >>>>> >>> all
>> >>>>> >>> going on, which may be why I've seen problems only
once in a
>> >>>>> >>> great
>> >>>>> >>> while.  See
>> >>>>> >>>
>> >>>>> >>>
>> >>>>> >>> https://cwiki.apache.org/confluence/display/CONNECTORS/Database+Performance
>> >>>>> >>> .  Are you seeing better parallelism than this?
 Are there MySQL
>> >>>>> >>> switch settings you have changed to enable decent
performance?
>> >>>>> >>> What
>> >>>>> >>> version of MySQL are you using, and what OS?
>> >>>>> >>>
>> >>>>> >>> Karl
>> >>>>> >>>
>> >>>>> >>>
>> >>>>> >>> On Mon, Jul 30, 2012 at 9:05 PM, Shigeki Kobayashi
>> >>>>> >>> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>> >>>>> >>> > Karl,
>> >>>>> >>> >
>> >>>>> >>> >
>> >>>>> >>> > I do not see any exceptions in the log.
>> >>>>> >>> >
>> >>>>> >>> > Thanks.
>> >>>>> >>> >
>> >>>>> >>> > Regards,
>> >>>>> >>> >
>> >>>>> >>> > Shigeki
>> >>>>> >>> >
>> >>>>> >>> >
>> >>>>> >>> > 2012/7/31 Karl Wright <daddywri@gmail.com>
>> >>>>> >>> >>
>> >>>>> >>> >> One more question: do you see any exceptions
in the
>> >>>>> >>> >> manifoldcf log
>> >>>>> >>> >> file?
>> >>>>> >>> >>
>> >>>>> >>> >> Karl
>> >>>>> >>> >>
>> >>>>> >>> >> On Mon, Jul 30, 2012 at 7:03 AM, Karl
Wright
>> >>>>> >>> >> <daddywri@gmail.com>
>> >>>>> >>> >> wrote:
>> >>>>> >>> >> > This means that we are seeing some
kind of transactional
>> >>>>> >>> >> > integrity
>> >>>>> >>> >> > problem
>> >>>>> >>> >> > with MySQL.  I have seen hints of
this behavior before.  It
>> >>>>> >>> >> > is
>> >>>>> >>> >> > not a
>> >>>>> >>> >> > difference in logic.  It could be
due to either MySQL bugs
>> >>>>> >>> >> > or
>> >>>>> >>> >> > subtle
>> >>>>> >>> >> > differences in how transactions work
in MySQL.
>> >>>>> >>> >> >
>> >>>>> >>> >> > I will try to write a load test that
uses hopcount filters
>> >>>>> >>> >> > in
>> >>>>> >>> >> > order
>> >>>>> >>> >> > to
>> >>>>> >>> >> > see
>> >>>>> >>> >> > if the problem can be reliably reproduced
here.  If it
>> >>>>> >>> >> > turns out
>> >>>>> >>> >> > to
>> >>>>> >>> >> > be a
>> >>>>> >>> >> > MySQL problem there would not be
much we could do to fix
>> >>>>> >>> >> > the
>> >>>>> >>> >> > issue.
>> >>>>> >>> >> >
>> >>>>> >>> >> > Karl
>> >>>>> >>> >> >
>> >>>>> >>> >> > Sent from my Windows Phone
>> >>>>> >>> >> > ________________________________
>> >>>>> >>> >> > From: Shigeki Kobayashi
>> >>>>> >>> >> > Sent: 7/30/2012 6:36 AM
>> >>>>> >>> >> > To: user@manifoldcf.apache.org
>> >>>>> >>> >> > Subject: Re: crawled counts on WEB
crawling differ between
>> >>>>> >>> >> > MCF0.4 and
>> >>>>> >>> >> > MCF0.5
>> >>>>> >>> >> >
>> >>>>> >>> >> >
>> >>>>> >>> >> >>(1) Make sure that the repository
connections and job
>> >>>>> >>> >> >> definitions are
>> >>>>> >>> >> > indeed identical between MySQL and
PostgreSQL.
>> >>>>> >>> >> >
>> >>>>> >>> >> > Yes, they are all the same.
>> >>>>> >>> >> >
>> >>>>> >>> >> >>(2) See if you can locate an example
document that was
>> >>>>> >>> >> >> crawled
>> >>>>> >>> >> >> with
>> >>>>> >>> >> > PostgreSQL but not crawled with MySQL.
>> >>>>> >>> >> >
>> >>>>> >>> >> > I confirmed the documents crawled
with PostgreSQL but not
>> >>>>> >>> >> > crawled
>> >>>>> >>> >> > with
>> >>>>> >>> >> > MySQL
>> >>>>> >>> >> > actually exist.
>> >>>>> >>> >> >
>> >>>>> >>> >> >>(3) If you create a second web
connection and job under
>> >>>>> >>> >> >> MySQL,
>> >>>>> >>> >> >> and
>> >>>>> >>> >> >> run
>> >>>>> >>> >> > the job to completion, does the document
that was not
>> >>>>> >>> >> > included
>> >>>>> >>> >> > get
>> >>>>> >>> >> > skipped again?  Or does it seem random
which documents are
>> >>>>> >>> >> > skipped on
>> >>>>> >>> >> > each run?
>> >>>>> >>> >> >
>> >>>>> >>> >> > Ok. I created two connections and
jobs with exactly same
>> >>>>> >>> >> > description,
>> >>>>> >>> >> > and
>> >>>>> >>> >> > then
>> >>>>> >>> >> > ran the jobs to completion.
>> >>>>> >>> >> > Those run resulted with different
number of crawled
>> >>>>> >>> >> > documents (
>> >>>>> >>> >> > as
>> >>>>> >>> >> > shown
>> >>>>> >>> >> > in
>> >>>>> >>> >> > the attached picture).
>> >>>>> >>> >> >
>> >>>>> >>> >> > It seems the first run skipped some
documents and the
>> >>>>> >>> >> > second run
>> >>>>> >>> >> > skipped
>> >>>>> >>> >> > different documents, but all the
skipped docs can be
>> >>>>> >>> >> > located.  I
>> >>>>> >>> >> > have
>> >>>>> >>> >> > no
>> >>>>> >>> >> > clue how those docs are skipped.
>> >>>>> >>> >> >
>> >>>>> >>> >> >
>> >>>>> >>> >> > Regards,
>> >>>>> >>> >> >
>> >>>>> >>> >> > Shigeki
>> >>>>> >>> >> >
>> >>>>> >>> >> > 2012/7/30 Karl Wright <daddywri@gmail.com>
>> >>>>> >>> >> >>
>> >>>>> >>> >> >> There should be no differences
between crawling using
>> >>>>> >>> >> >> MySQL as
>> >>>>> >>> >> >> the
>> >>>>> >>> >> >> database and PostgreSQL, on the
same version of
>> >>>>> >>> >> >> ManifoldCF.
>> >>>>> >>> >> >>
>> >>>>> >>> >> >> We include an RSS crawling test
which finds exactly the
>> >>>>> >>> >> >> expected
>> >>>>> >>> >> >> number of documents on MySQL.
 This is a 100,000 document
>> >>>>> >>> >> >> crawl.
>> >>>>> >>> >> >> There are no back-end-specific
logic differences in the
>> >>>>> >>> >> >> web
>> >>>>> >>> >> >> connector
>> >>>>> >>> >> >> that would be expected to yield
different results based on
>> >>>>> >>> >> >> the
>> >>>>> >>> >> >> back-end database.
>> >>>>> >>> >> >>
>> >>>>> >>> >> >> If you believe you have found
a difference between MySQL
>> >>>>> >>> >> >> and
>> >>>>> >>> >> >> PostgreSQL, I suggest the following:
>> >>>>> >>> >> >>
>> >>>>> >>> >> >> (1) Make sure that the repository
connections and job
>> >>>>> >>> >> >> definitions
>> >>>>> >>> >> >> are
>> >>>>> >>> >> >> indeed identical between MySQL
and PostgreSQL.
>> >>>>> >>> >> >> (2) See if you can locate an
example document that was
>> >>>>> >>> >> >> crawled
>> >>>>> >>> >> >> with
>> >>>>> >>> >> >> PostgreSQL but not crawled with
MySQL.
>> >>>>> >>> >> >> (3) If you create a second web
connection and job under
>> >>>>> >>> >> >> MySQL,
>> >>>>> >>> >> >> and
>> >>>>> >>> >> >> run
>> >>>>> >>> >> >> the job to completion, does the
document that was not
>> >>>>> >>> >> >> included
>> >>>>> >>> >> >> get
>> >>>>> >>> >> >> skipped again?  Or does it seem
random which documents are
>> >>>>> >>> >> >> skipped
>> >>>>> >>> >> >> on
>> >>>>> >>> >> >> each run?
>> >>>>> >>> >> >>
>> >>>>> >>> >> >> Thanks,
>> >>>>> >>> >> >> Karl
>> >>>>> >>> >> >>
>> >>>>> >>> >> >>
>> >>>>> >>> >> >>
>> >>>>> >>> >> >> On Sun, Jul 29, 2012 at 9:51
PM, Shigeki Kobayashi
>> >>>>> >>> >> >> <shigeki.kobayashi3@g.softbank.co.jp>
wrote:
>> >>>>> >>> >> >> > Aren't there some difference
in crawling logics between
>> >>>>> >>> >> >> > MySQL
>> >>>>> >>> >> >> > and
>> >>>>> >>> >> >> > PostgreSQL?
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> > I did some tests on web
crawling using both of MySQL and
>> >>>>> >>> >> >> > PostgreSQL.
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> > MCF0.5 running on MySQL
indexed around 6000, and
>> >>>>> >>> >> >> > meanwhile
>> >>>>> >>> >> >> > MCF0.5
>> >>>>> >>> >> >> > running on
>> >>>>> >>> >> >> > PostgreSQL indexed over
12000 documents.
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> > MCF0.6 running on MySQL
indexed around 6000. MCF0.4
>> >>>>> >>> >> >> > running
>> >>>>> >>> >> >> > on
>> >>>>> >>> >> >> > PostgreSQL
>> >>>>> >>> >> >> > indexed over 12000 documents.
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> > Each number of indexed documents
above is a result of
>> >>>>> >>> >> >> > first
>> >>>>> >>> >> >> > crawling
>> >>>>> >>> >> >> > after
>> >>>>> >>> >> >> > deleting indexing history
from DB.
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> > It seems that changing DB
affects crawling and indexing.
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> > Regards,
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> > Shigeki
>> >>>>> >>> >> >> >
>> >>>>> >>> >> >> > 2012/7/27 Karl Wright <daddywri@gmail.com>
>> >>>>> >>> >> >> >>
>> >>>>> >>> >> >> >> There was a bug fixed
in the way hopcount was being
>> >>>>> >>> >> >> >> computed.
>> >>>>> >>> >> >> >> See
>> >>>>> >>> >> >> >> CONNECTORS-464.
>> >>>>> >>> >> >> >>
>> >>>>> >>> >> >> >> This means that fewer
documents are left in the queue,
>> >>>>> >>> >> >> >> but
>> >>>>> >>> >> >> >> the
>> >>>>> >>> >> >> >> number
>> >>>>> >>> >> >> >> of indexed documents
should be the same.
>> >>>>> >>> >> >> >>
>> >>>>> >>> >> >> >> Karl
>> >>>>> >>> >> >> >>
>> >>>>> >>> >> >> >> On Fri, Jul 27, 2012
at 3:00 AM, Shigeki Kobayashi
>> >>>>> >>> >> >> >> <shigeki.kobayashi3@g.softbank.co.jp>
wrote:
>> >>>>> >>> >> >> >> >
>> >>>>> >>> >> >> >> > Hi guys.
>> >>>>> >>> >> >> >> >
>> >>>>> >>> >> >> >> >
>> >>>>> >>> >> >> >> > I wonder if anyone
has ever faced the experience on
>> >>>>> >>> >> >> >> > web
>> >>>>> >>> >> >> >> > crawling
>> >>>>> >>> >> >> >> > that
>> >>>>> >>> >> >> >> > the
>> >>>>> >>> >> >> >> > number of crawled
counts differs between MCF0.4
>> >>>>> >>> >> >> >> > and MCF0.5.
>> >>>>> >>> >> >> >> >
>> >>>>> >>> >> >> >> >
>> >>>>> >>> >> >> >> > I crawled some
portal sites on intranet using MCF0.4
>> >>>>> >>> >> >> >> > and
>> >>>>> >>> >> >> >> > MCF0.5.
>> >>>>> >>> >> >> >> > MCF0.4 crawled
over 12000 contents, and meanwhile,
>> >>>>> >>> >> >> >> > MCF0.5
>> >>>>> >>> >> >> >> > crawled
>> >>>>> >>> >> >> >> > only
>> >>>>> >>> >> >> >> > around half of
the contents.
>> >>>>> >>> >> >> >> > I ran MCF0.4 on
PostgreSQL and MCF0.5 on MySQL.
>> >>>>> >>> >> >> >> > I hope changing
DB does not affect the crawling
>> >>>>> >>> >> >> >> > results:
>> >>>>> >>> >> >> >> >
>> >>>>> >>> >> >> >> >
>> >>>>> >>> >> >> >> > MCF0.4:
>> >>>>> >>> >> >> >> >   - Crawled Counts:
12000 and over
>> >>>>> >>> >> >> >> >   - Solr3.5
>> >>>>> >>> >> >> >> >   - PostgreSQL
9.1.3
>> >>>>> >>> >> >> >> >   - Tomcat6
>> >>>>> >>> >> >> >> >   - Max Hop on
Links: 15
>> >>>>> >>> >> >> >> >   - Max Hop on
Redirects: 10
>> >>>>> >>> >> >> >> >   - Include only
hosts matching seeds: Checked
>> >>>>> >>> >> >> >> >   - org.apache.manifoldcf.crawler.threads:
50
>> >>>>> >>> >> >> >> >   - org.apache.manifoldcf.database.maxhandles:
100
>> >>>>> >>> >> >> >> >
>> >>>>> >>> >> >> >> >
>> >>>>> >>> >> >> >> > MCF0.5:
>> >>>>> >>> >> >> >> >   - Crawled Counts:
around 6000
>> >>>>> >>> >> >> >> >   - Solr3.5
>> >>>>> >>> >> >> >> >   - MySQL5.5
>> >>>>> >>> >> >> >> >   - Tomcat6
>> >>>>> >>> >> >> >> >   - Max Hop on
Links: 15
>> >>>>> >>> >> >> >> >   - Max Hop on
Redirects: 10
>> >>>>> >>> >> >> >> >   - Include only
hosts matching seeds: Checked
>> >>>>> >>> >> >> >> >   - org.apache.manifoldcf.crawler.threads:
50
>> >>>>> >>> >> >> >> >   - org.apache.manifoldcf.database.maxhandles:
100
>> >>>>> >>> >> >> >> >
>> >>>>> >>> >> >> >> >
>> >>>>> >>> >> >> >> > Does anyone have
any ideas?
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜
>> >>>  ソフトバンクモバイル株式会社
>> >>>  情報システム本部
>> >>>  システムサービス事業統括部
>> >>>  サービス企画部
>> >>>
>> >>>  小林 茂樹
>> >>>  shigeki.kobayashi3@g.softbank.co.jp
>> >>> 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜
>> >>>
>> >>>
>> >>>
>
>
>
>
> --
> 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜
>  ソフトバンクモバイル株式会社
>  情報システム本部
>  システムサービス事業統括部
>  サービス企画部
>
>  小林 茂樹
>  shigeki.kobayashi3@g.softbank.co.jp
> 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜
>
>
>

Mime
View raw message