manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Web crawler does not follow the robots meta tag rules
Date Thu, 03 Feb 2011 11:36:29 GMT
I was using seed: http://ridder.uio.no
Perhaps that accounts for the difference.  Nevertheless, since
http://ridder.uio.no is fetchable, a fix for that problem is still
needed.  (I did, BTW, try appending a "/" to the URI if the path part
was determined to be null, but that too did not work.)

Karl

On Thu, Feb 3, 2011 at 4:57 AM, Erlend Garåsen <e.f.garasen@usit.uio.no> wrote:
>
> Honestly, I haven't modified any crawler code at all. Are you sure you
> entered a url with an trailing slash in the seed list? I tried to skip that
> slash, and then the crawler began to act strangely. I cannot reproduce your
> results.
>
> This is my settings:
> Seed: http://ridder.uio.no/
> Inclusions: ^http://ridder.uio.no/.* (marked "include only host matching
> ...")
>
> Everything works like a dream. The only problem I have with the PDF document
> is that it does not parse the Norwegian characters correctly, but this can
> be a Tika bug since all other document formats are parsed correctly.
>
> BTW: I did a svn update, ant clean -> build, and now the document with the
> noindex rule is skipped. Great. Thanks a zillion!
>
> And regarding the Solr trick with the jar files I had to move manually since
> they were excluded from solr.jar (my last home lesson):
> - When Solr is running in a servlet container such as Resin, you have to
> move the following jars manually into the <solr.home>/lib directory in order
> to enable the ExtractingRequestHandler:
>  - apache-solr-cell-*.jar
>  - the other Tika jars
>
> You will find the same information in the following file:
> solr_trunk/solr/contrib/extraction/CHANGES.txt.
>
> Erlend
>
>
> On 02.02.11 17.34, Karl Wright wrote:
>>
>> Turns out Java doesn't like the form of those URLs; it doesn't they're
>> proper:
>>
>> WEB: Can't use url 'dokument.pdf' because it is badly formed: Relative
>> path in absolute URI: http://ridder.uio.nodokument.pdf
>> WEB: In html document 'http://ridder.uio.no', found an unincluded URL
>> 'dokument.pdf'
>>
>> This is the java.net.URI class:
>>
>>         java.net.URI parentURL = new java.net.URI(parentIdentifier);
>>         url = parentURL.resolve(rawURL);
>>
>> ... and this is throwing a java.net.URISyntaxException.
>>
>> I'm going to have to go look at the standards to figure out what we
>> should do here.  Perhaps the right approach is to note the exception
>> and retry with a "/" glommed on the front if we get it.
>>
>> But clearly you must have modified the web connector in order to get
>> it to crawl your stuff in the first place.
>>
>> Karl
>>
>> On Wed, Feb 2, 2011 at 11:08 AM, Karl Wright<daddywri@gmail.com>  wrote:
>>>
>>> Hmm.  I get 701 bytes from your seed, but no parseable links.
>>>  Investigating...
>>> Karl
>>>
>>> On Wed, Feb 2, 2011 at 10:45 AM, Erlend Garåsen<e.f.garasen@usit.uio.no>
>>>  wrote:
>>>>
>>>> On 28.01.11 14.32, Karl Wright wrote:
>>>>>
>>>>> Thanks.  I tested my changes enough so that I was confident in
>>>>> committing the patch, so the changes are in trunk.
>>>>
>>>> I'm afraid that it doesn't work properly. I downloaded the latest
>>>> version
>>>> from trunk and started the crawler.
>>>>
>>>> Try to use the following address in your seed list and the following
>>>> rule in
>>>> the includes list:
>>>> ^http://ridder.uio.no/.*
>>>>
>>>> The following document was fetched and sent to Solr for indexing even
>>>> though
>>>> it includes a robots noindex rule:
>>>> http://ridder.uio.no/test_closed/
>>>>
>>>> Here's the line from the history telling me that Sole should index it:
>>>> 02-02-2011 16:12:33.283         document ingest (Solr)
>>>> http://ridder.uio.no/test_closed/
>>>>        200
>>>>
>>>> I can try to modify the code you have added in order to get around this
>>>> tomorrow. I guess I can find the relevant check somewhere in the
>>>> following
>>>> folder?
>>>>
>>>> mcf-trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler
>>>>
>>>> Erlend
>>>>
>>>> --
>>>> Erlend Garåsen
>>>> Center for Information Technology Services
>>>> University of Oslo
>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>> 31050
>>>>
>>>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Mime
View raw message