Solr has an HTMLStripReader used by an two different tokenizers for doing
the basics of ignoring tags when reading text ... it has one known bug
when dealing with highlighting...
http://lucene.apache.org/solr/api/org/apache/solr/analysis/HTMLStripReader.html
http://lucene.apache.org/solr/api/org/apache/solr/analysis/HTMLStripStandardTokenizerFactory.html
http://lucene.apache.org/solr/api/org/apache/solr/analysis/HTMLStripWhitespaceTokenizerFactory.html
http://issues.apache.org/jira/browse/SOLR-42
: Date: Wed, 7 Feb 2007 17:04:54 -0800 (PST)
: From: Joe Tang
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: How to not tokenize HTML tag from input string
:
:
: My work is to index keywords with a document. In my case, the document is
: made up with HTML tags which i don't want to index them.
:
: For example:
: Input Document:
:
:
You are welcome
:
:
Testing text
:
:
:
: Expected Keywords:
: keywords:You
: keywords:are
: keywords:welcome
: keywords:Testing
: keywords:text
:
: Is there anyway I can make them not to be one of the keywords?
: --
: View this message in context: http://www.nabble.com/How-to-not-tokenize-HTML-tag-from-input-string-tf3190778.html#a8857789
: Sent from the Lucene - Java Users mailing list archive at Nabble.com.
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org