lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Preserving original HTML file offsets for highlighting
Date Mon, 24 Jan 2011 13:47:26 GMT
You can use HTMLStripCharFilter that is plugged into the chain before the
Tokenizer. This one strips all HTML but preserves the Token positions, so
you can later highlight using those positions.

This filter is currently only released through Apache Solr, but in Lucene
4.0 its part of the analysis module.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Karolina Bernat [mailto:karolina.bernat@googlemail.com]
> Sent: Monday, January 24, 2011 2:03 PM
> To: java-user@lucene.apache.org
> Subject: Preserving original HTML file offsets for highlighting
> 
> Hi all,
> 
> I'm new to Lucene and have a question about indexing/highlighting of HTML
> files with Lucene.
> 
> What I need to do is highlight the hits (terms) in the original HTML file
(or get
> the positions of the terms/tokens in the original file).
> This problem has already been described by Fred Toth in this thread in
2005
> (Preserving original HTML file offsets for highlighting, need
> HTMLTokenizer?):
> 
> http://mail-archives.apache.org/mod_mbox/lucene-java-
> user/200505.mbox/%3C6.2.1.2.2.20050530134630.063ae978@fast.synernet.c
> om%3E
> 
> I've searched the mailing list archives hoping for an answer, but I had no
luck.
> 
> Does anyone have an idea, if there is a solution for this problem? Also if
you
> know, that it's not possible with Lucene to highlight the hits in the
original
> html-file, it would be helpful to know (I could stop looking for it...).
> 
> Many thanks in advance!
> Karo
> 
> P.S. Actually I wanted to answer the original thred/question from 2005 -
is
> there a way to do this? How can I post an answer to an old thread/mail
from
> the mailing list?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message