lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oystein Reigem <>
Subject Highlighting of original documents
Date Tue, 13 Mar 2007 14:59:20 GMT

I want to implement fulltext search on a collection of documents. I try 
to figure out which system is the better choice - eXist, or Lucene, or 
some combination of the two. I have some knowledge of eXist, but don't 
know too much about Lucene.

I'd like to display the result of a search as a list of 
excerpts/snippets with highlighted search words. When the user clicks an 
item in the result list to bring up the document in full, I'd like to 
have search words highlighted in the full document as well.

The document collection is very diverse. There are pure text documents 
and well-formed XML and HTML documents, but unfortunately also HTML 
documents that are not quite well-formed, Word documents and PDFs. Many 
of the formats go beyond what eXist and Lucene can handle, and I realise 
some conversion, or text extraction, is necessary. As far as I know 
Lucene can only index and search pure text (and fields), so the 
documents must be run through appropriate filters extracting the text 
(and field values). Afterwards fulltext search is possible.

But what about highlighting? I know it is possible to get highlighting 
in the pure text version, but what about the original document, when the 
original document is something else than pure text, e.g, a simple XML 
document? Is it at all possible to get the search words tagged in the 
XML document?

I assume not, but ask anyway. :-)


- Øystein -

Øystein Reigem, The department of culture, language and information technology (Aksis), Allegt
27, N-5007 Bergen, Norway. Tel: +47 55 58 32 42. Fax: +47 55 58 94 70. E-mail: <>.
Home tel: +47 56 14 06 11. Mobile: +47 97 16 96 64. Home e-mail: <>.
Aksis home page: <>.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message