lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lance Norskog (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-678) HTMLStripStandardTokenizerFactory doesn't interpret word boundaries on html tags correctly.
Date Thu, 26 Nov 2009 04:06:39 GMT

    [ https://issues.apache.org/jira/browse/SOLR-678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782741#action_12782741
] 

Lance Norskog commented on SOLR-678:
------------------------------------

HTMLStripStandardTokenizerFactory & Whitespace are deprecated. Recommend closing this
issue.

> HTMLStripStandardTokenizerFactory doesn't interpret word boundaries on html tags correctly.
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-678
>                 URL: https://issues.apache.org/jira/browse/SOLR-678
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>    Affects Versions: 1.2
>         Environment: Mac OS X 10.5.4, java version "1.5.0_13"
>            Reporter: Matt Connolly
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> The HTMLStripStandardTokenizerFactory filter does not place word boundaries on HTML tags
like it should.
> For example, indexing the text "<h2>title</h2><p>some comment</p>"
results in two words being indexed: "titlesome" and "comment" when there should be three words:
"title" "some" and "comment".
> Not all tags need this, for example, it may be perfectly reasonable to write "<b>sub</b>script"
to be indexed as "subscript" since the <b> is interpretted as inline, not block.
> I would suggest all block or paragraph tags be translated into spaces so that text on
either side of the tag is considered separate tokens. eg: p div h1 h2 h3 h4 h5 h6 br hr pre
  (etc)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message