lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Rowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-10337) HTMLStripCharFilterFactory does not seem to handle <script> section inside a <body> section
Date Tue, 21 Mar 2017 23:58:41 GMT

    [ https://issues.apache.org/jira/browse/SOLR-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15935563#comment-15935563
] 

Steve Rowe commented on SOLR-10337:
-----------------------------------

I can't reproduce the problem; here's the test I added to {{TestHTMLStripCharFilterFactory.java}}
(a copy of your failing content):

{noformat}
  
  public void testScript() throws Exception {
    final String text = "<body>\n" +
        "<script>\n" +
        "function myFunctionInsideBody() {\n" +
        "   document.getElementById(\"demo_body\").innerHTML = \"Paragraph changed.\";\n"
+
        "}\n" +
        "</script>\n" +
        "word\n" +
        "</body>\n";
    Reader cs = charFilterFactory("HTMLStrip").create(new StringReader(text));
    TokenStream ts = whitespaceMockTokenizer(cs);
    assertTokenStreamContents(ts, new String[] { "..." });
  }
{noformat}


> HTMLStripCharFilterFactory does not seem to handle <script> section inside a <body>
section
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-10337
>                 URL: https://issues.apache.org/jira/browse/SOLR-10337
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 6.4.1
>         Environment: WIndows 7 professional 64bit (current patch/release)
>            Reporter: NW Brad
>
> HTMLStripCharFilterFactory does not remove <script> sections from the <body>
section of HTML document, but works fine in the <head> section.
> Fails remove <script> section content (removes tags, leaves content):
> <body>
> <script>
> function myFunctionInsideBody() {
>     document.getElementById("demo_body").innerHTML = "Paragraph changed.";
> }
> </script>
> ...
> </body>
> Works - removes entire <script> section:
> <head>
> <script>
> function myFunctionInsideHead() {
>     document.getElementById("demo_head").innerHTML = "Paragraph changed.";
> }
> </script>
> ...
> </head>



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message