manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Web crawler doesn't extract links
Date Sun, 06 Dec 2015 17:55:08 GMT
I've created a ticket (https://issues.apache.org/jira/browse/CONNECTORS-1264)
and attached a patch.

Karl


On Sun, Dec 6, 2015 at 9:22 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Issei,
>
> MCF's html parser handles unquoted attribute values, but there are limits
> to what characters you can put in an unquoted attribute value according to
>  HTML4.  It's not clear that "/" is in fact an allowed character, but if
> you believe that it is, then please open a ticket and I will fix the
> problem.
>
> Thanks,
> Karl
>
>
> On Sun, Dec 6, 2015 at 9:11 AM, Issei Nishigata <duo.2029@gmail.com>
> wrote:
>
>> I'm using MCF 2.2.
>> When I crawl links that attribute values of href like below, MCF can't
>> extract links properly.
>>
>> <a href=/sample/Mainservlet?sample=000 >sample</a>
>> # attribute value doesn't specified by the double quoted.
>> # I got "/sample".
>>
>> In HTML4, it does not always require quotes around attribute value.
>> XHTML requires quotes around attribute value.
>> Is MCF compliant with HTML4?
>>
>>
>> Thanks,
>> Issei
>>
>
>

Mime
View raw message