lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oystein Reigem <oystein.rei...@aksis.uib.no>
Subject Re: Highlighting of original documents
Date Tue, 13 Mar 2007 16:21:55 GMT
Mark Miller wrote:

> Depends on the work you want to do. If you want to highlight a simple 
> XML doc the approach would be to extract all of the text elements and 
> run them through the highlighter and then correctly update them. That 
> would be mostly simple DOM manipulation.

OK.

I guess there will be some details that need special attention. One case 
that springs to mind is the occurrence of words that in the original 
document are broken up by encoding, like "en<hyphen/>coding" or 
"<em>mid</em>range".

> The same approach should work with any format but the difficulty in 
> modifying the text may increase. If you can pull the text out 
> appropriately it would seem you could put it back in though, or modify 
> it in place as you might with the DOM.

Do you know if tools (classes) for "appropriate" extraction from "my" 
file formats already exist in Lucene? I.e, something that not just 
extracts the text, but keeps track of its position in the original?

I saw POI <http://jakarta.apache.org/poi/> mentioned in a posting on 
this list. Perhaps a solution for Word documents can be based on POI.

- Øystein -

>
> - Mark
>
> Oystein Reigem wrote:
>
>> Hi,
>>
>> I want to implement fulltext search on a collection of documents. I 
>> try to figure out which system is the better choice - eXist, or 
>> Lucene, or some combination of the two. I have some knowledge of 
>> eXist, but don't know too much about Lucene.
>>
>> I'd like to display the result of a search as a list of 
>> excerpts/snippets with highlighted search words. When the user clicks 
>> an item in the result list to bring up the document in full, I'd like 
>> to have search words highlighted in the full document as well.
>>
>> The document collection is very diverse. There are pure text 
>> documents and well-formed XML and HTML documents, but unfortunately 
>> also HTML documents that are not quite well-formed, Word documents 
>> and PDFs. Many of the formats go beyond what eXist and Lucene can 
>> handle, and I realise some conversion, or text extraction, is 
>> necessary. As far as I know Lucene can only index and search pure 
>> text (and fields), so the documents must be run through appropriate 
>> filters extracting the text (and field values). Afterwards fulltext 
>> search is possible.
>>
>> But what about highlighting? I know it is possible to get 
>> highlighting in the pure text version, but what about the original 
>> document, when the original document is something else than pure 
>> text, e.g, a simple XML document? Is it at all possible to get the 
>> search words tagged in the XML document?
>>
>> I assume not, but ask anyway. :-)
>>
>> Cheers,
>>
>> - Øystein -
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


-- 
Øystein Reigem, The department of culture, language and information technology (Aksis), Allegt
27, N-5007 Bergen, Norway. Tel: +47 55 58 32 42. Fax: +47 55 58 94 70. E-mail: <oystein.reigem@aksis.uib.no>.
Home tel: +47 56 14 06 11. Mobile: +47 97 16 96 64. Home e-mail: <oreigem@broadpark.no>.
Aksis home page: <www.aksis.uib.no>.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message