lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <>
Subject Re: Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?
Date Mon, 07 Jun 2010 21:09:00 GMT
> I need to index HTML documents and one of the requirements
> is to highlight
> documents while maintaining all of the original formatting.
> The documents
> are relatively simple HTML, meaning no JavaScript code that
> changes elements
> at runtime or too fancy CSS styling.
> I think it should be possible to write a tokenizer that
> strips out the HTML
> tags but maintains the original offsets within the HTML
> document so they
> can be used for highlighting the original HTML document,
> not just the
> text representation.
> Does anybody know any tokenizers that can do this? It seems
> it's something
> other people may need too.
> I am fairly new to Lucene so I may have chosen the wrong
> terminology but I
> hope this makes sense.

You can use org.apache.solr.analysis.HTMLStripCharFilter. It is possible to add one or more
org.apache.lucene.analysis.CharFilter(s) before tokenizer in your analyzer.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message