manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brad Dennis <Brad.Den...@directsupply.com>
Subject RE: Webconnector: Comparison operator '<' in the body of a script tag
Date Wed, 24 Jun 2015 15:16:21 GMT
Karl,

The patch is working.  Thank you very much!  Also, thank you for your clarification on the
behavior of the parser.  It's pretty complex.

Brad

-----Original Message-----
From: Karl Wright [mailto:daddywri@gmail.com] 
Sent: Wednesday, June 24, 2015 10:04 AM
To: dev
Subject: Re: Webconnector: Comparison operator '<' in the body of a script tag

Hi Brad,

I've attached a patch to the ticket:
https://issues.apache.org/jira/browse/CONNECTORS-1215 .  This patch merely tightens what the
fuzzyml parser regards as a valid tag start, to adhere to the w3c specification.  I don't
know whether browsers do it that way or not, but it should fix the specific page you included
n your post.

Please let me know if you run into further difficulties with other pages; we can look at them
one at a time.

Karl


On Wed, Jun 24, 2015 at 10:49 AM, Brad Dennis <Brad.Dennis@directsupply.com>
wrote:

> Karl,
>
> Thank you for investigating the issue.  My concern is that I expect 
> it's fairly common to use '<' in embedded, uncommented, Javascript and 
> this bug excludes any content that appears after one and before a 
> second end script tag from being crawled with ManifoldCF.  
> Unfortunately, I don't have any suggestions other than using a stack 
> to push open tags onto and pop off when an end tag is seen.  I believe 
> that would satisfy your example, but who knows what other problems a stack brings.
>
> Do you have any suggestions for work arounds I could implement locally?
>
> Thanks,
> Brad
>
> -----Original Message-----
> From: Karl Wright [mailto:daddywri@gmail.com]
> Sent: Wednesday, June 24, 2015 9:33 AM
> To: dev
> Subject: Re: Webconnector: Comparison operator '<' in the body of a 
> script tag
>
> Brad,
>
> The issue is complex because according to spec the code is doing the 
> right thing.  Typically, <script> blocks look something like this:
>
> <script ...>
> <!--
>
> ...
>
> //-->
> </script>
>
> The reason for the comment area is because without it, tags within the 
> script block are supposed to be recognized as such, even if they are 
> ignored.  Within comments, this does not happen, of course, which is 
> why comments are used.
>
> I don't believe it is a real standard, but some browsers try to 
> interpret script blocks differently even when no comment is given.  We 
> can try to emulate that behavior but it is likely that our emulation 
> will not work for all web pages, since it's not a standard.  Exploring 
> how this works on various browsers would be the first step.  
> Specifically, if you do something like this:
>
> <script ...>
>
> foo = "<script></script>";
> bar = "hello";
>
> </script>
>
> ... what happens?  Does the script end at the first </script>, or the 
> second?  And, in what browsers?
>
> Until we get more clarity it's going to be hard to do a feature that 
> actually helps rather than hurts...
>
> Karl
>
>
> On Wed, Jun 24, 2015 at 10:05 AM, Karl Wright <daddywri@gmail.com> wrote:
>
> > Hi Brad,
> >
> > I've created a ticket: CONNECTORS-1215.  Looking into this now.
> >
> > Karl
> >
> >
> > On Wed, Jun 24, 2015 at 9:45 AM, Brad Dennis 
> > <Brad.Dennis@directsupply.com
> > > wrote:
> >
> >> Hi,
> >>
> >> There appears to be a bug in the TagParseState when the comparison 
> >> operator '<'  is encountered in the body of  a script tag.  It 
> >> appears to get flagged as an open tag and then the next '</' closes 
> >> it.  In my case, the next '</' is the script tag.  The 
> >> ScriptParseState chomps everything until it encounters a second
> </script> tag.
> >>
> >> A live link that demonstrates this bug is here:
> >>
> >> http://www.prnewswire.com/search-results/news/Google%252C%2520Inc.-
> >> 30
> >> -days-page-1-pagesize-20
> >>
> >> The '<' near line 2826 in the script body that begins near   line 2759
> >> begins a new tag 'arraykeywords.length' which gets closed by the '</'
> >> in the closing script tag.  The ScriptParseState chomps all the 
> >> html until it sees the end script tag near line 3385.
> >>
> >> At the moment, I'm not sure of a solution other than pushing the 
> >> script tag handling up to the TagParseState and treating it like 
> >> CDATA
> is.
> >>
> >>
> >> Thanks,
> >>
> >> Brad Dennis
> >>
> >>
> >>
> >
>
Mime
View raw message