lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Tang <joe.t...@workmetro.com>
Subject How to not tokenize HTML tag from input string
Date Thu, 08 Feb 2007 00:10:02 GMT

My work is to index keywords with a document. In my case, the document is
made up with HTML tags which i don't want to index them.

For example: 
Input Document: 
<div id="tp-wrapper">
<span id="tp-top-right">You are welcome</span> 
<div id="tp-tab"> 
<h1>Testing text</h1>
/images/gui/tab_grey_bkg_lftend.gif 
</div>
</div>

Expected Keywords:
keywords:You
keywords:are
keywords:welcome
keywords:Testing
keywords:text

Is there anyway I can make them not to be one of the keywords?
-- 
View this message in context: http://www.nabble.com/How-to-not-tokenize-HTML-tag-from-input-string-tf3190611.html#a8857238
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message