manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Web crawler doesn't extract links
Date Sun, 06 Dec 2015 14:22:57 GMT
Hi Issei,

MCF's html parser handles unquoted attribute values, but there are limits
to what characters you can put in an unquoted attribute value according to
 HTML4.  It's not clear that "/" is in fact an allowed character, but if
you believe that it is, then please open a ticket and I will fix the
problem.

Thanks,
Karl


On Sun, Dec 6, 2015 at 9:11 AM, Issei Nishigata <duo.2029@gmail.com> wrote:

> I'm using MCF 2.2.
> When I crawl links that attribute values of href like below, MCF can't
> extract links properly.
>
> <a href=/sample/Mainservlet?sample=000 >sample</a>
> # attribute value doesn't specified by the double quoted.
> # I got "/sample".
>
> In HTML4, it does not always require quotes around attribute value.
> XHTML requires quotes around attribute value.
> Is MCF compliant with HTML4?
>
>
> Thanks,
> Issei
>

Mime
View raw message