lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hans Merkl <hme...@rightonpoint.us>
Subject Re: Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?
Date Tue, 08 Jun 2010 11:41:57 GMT
Hi Ahmet,

I am using Lucene.NET with C# so I can't test this quickly.
Will HTMLStripCharFilter maintain the character offsets or does it just
extract the plain text?

Hans


> You can use org.apache.solr.analysis.HTMLStripCharFilter. It is possible to
> add one or more org.apache.lucene.analysis.CharFilter(s) before tokenizer in
> your analyzer.
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message