manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Webconnector: Comparison operator '<' in the body of a script tag
Date Wed, 24 Jun 2015 15:03:33 GMT
Hi Brad,

I've attached a patch to the ticket:
https://issues.apache.org/jira/browse/CONNECTORS-1215 .  This patch merely
tightens what the fuzzyml parser regards as a valid tag start, to adhere to
the w3c specification.  I don't know whether browsers do it that way or
not, but it should fix the specific page you included n your post.

Please let me know if you run into further difficulties with other pages;
we can look at them one at a time.

Karl


On Wed, Jun 24, 2015 at 10:49 AM, Brad Dennis <Brad.Dennis@directsupply.com>
wrote:

> Karl,
>
> Thank you for investigating the issue.  My concern is that I expect it's
> fairly common to use '<' in embedded, uncommented, Javascript and this bug
> excludes any content that appears after one and before a second end script
> tag from being crawled with ManifoldCF.  Unfortunately, I don't have any
> suggestions other than using a stack to push open tags onto and pop off
> when an end tag is seen.  I believe that would satisfy your example, but
> who knows what other problems a stack brings.
>
> Do you have any suggestions for work arounds I could implement locally?
>
> Thanks,
> Brad
>
> -----Original Message-----
> From: Karl Wright [mailto:daddywri@gmail.com]
> Sent: Wednesday, June 24, 2015 9:33 AM
> To: dev
> Subject: Re: Webconnector: Comparison operator '<' in the body of a script
> tag
>
> Brad,
>
> The issue is complex because according to spec the code is doing the right
> thing.  Typically, <script> blocks look something like this:
>
> <script ...>
> <!--
>
> ...
>
> //-->
> </script>
>
> The reason for the comment area is because without it, tags within the
> script block are supposed to be recognized as such, even if they are
> ignored.  Within comments, this does not happen, of course, which is why
> comments are used.
>
> I don't believe it is a real standard, but some browsers try to interpret
> script blocks differently even when no comment is given.  We can try to
> emulate that behavior but it is likely that our emulation will not work for
> all web pages, since it's not a standard.  Exploring how this works on
> various browsers would be the first step.  Specifically, if you do
> something like this:
>
> <script ...>
>
> foo = "<script></script>";
> bar = "hello";
>
> </script>
>
> ... what happens?  Does the script end at the first </script>, or the
> second?  And, in what browsers?
>
> Until we get more clarity it's going to be hard to do a feature that
> actually helps rather than hurts...
>
> Karl
>
>
> On Wed, Jun 24, 2015 at 10:05 AM, Karl Wright <daddywri@gmail.com> wrote:
>
> > Hi Brad,
> >
> > I've created a ticket: CONNECTORS-1215.  Looking into this now.
> >
> > Karl
> >
> >
> > On Wed, Jun 24, 2015 at 9:45 AM, Brad Dennis
> > <Brad.Dennis@directsupply.com
> > > wrote:
> >
> >> Hi,
> >>
> >> There appears to be a bug in the TagParseState when the comparison
> >> operator '<'  is encountered in the body of  a script tag.  It
> >> appears to get flagged as an open tag and then the next '</' closes
> >> it.  In my case, the next '</' is the script tag.  The
> >> ScriptParseState chomps everything until it encounters a second
> </script> tag.
> >>
> >> A live link that demonstrates this bug is here:
> >>
> >> http://www.prnewswire.com/search-results/news/Google%252C%2520Inc.-30
> >> -days-page-1-pagesize-20
> >>
> >> The '<' near line 2826 in the script body that begins near   line 2759
> >> begins a new tag 'arraykeywords.length' which gets closed by the '</'
> >> in the closing script tag.  The ScriptParseState chomps all the html
> >> until it sees the end script tag near line 3385.
> >>
> >> At the moment, I'm not sure of a solution other than pushing the
> >> script tag handling up to the TagParseState and treating it like CDATA
> is.
> >>
> >>
> >> Thanks,
> >>
> >> Brad Dennis
> >>
> >>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message