lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: How to not tokenize HTML tag from input string
Date Thu, 08 Feb 2007 02:18:37 GMT
Sure, just don't index the html tags in the first place. Of course that
means you need to parse the document first. Here's a parser that was
mentioned on the thread a while ago....

http://sourceforge.net/projects/mozillaparser

There may very well be others....

Depending on how sophisticated you need to be, you might be able to do a
regular expression to remove all the HTML tags...

Best
Erick

On 2/7/07, Joe Tang <joe.tang@workmetro.com> wrote:
>
>
> My work is to index keywords with a document. In my case, the document is
> made up with HTML tags which i don't want to index them.
>
> For example:
> Input Document:
> <div id="tp-wrapper">
> <span id="tp-top-right">You are welcome</span>
> <div id="tp-tab">
> <h1>Testing text</h1>
> </div>
> </div>
>
> Expected Keywords:
> keywords:You
> keywords:are
> keywords:welcome
> keywords:Testing
> keywords:text
>
> Is there anyway I can make them not to be one of the keywords?
> --
> View this message in context:
> http://www.nabble.com/How-to-not-tokenize-HTML-tag-from-input-string-tf3190778.html#a8857789
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message