lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ulrich Mayring <>
Subject Re: Summarization; sentence-level and document-level filters.
Date Wed, 17 Dec 2003 09:10:08 GMT
Gregor Heinrich wrote:
> Yes, copying a summary from one field to an untokenized field was the plan.
> I identified DocumentWriter.invertDocument() to be a possible place for an
> addition of this document-level analysis. But I admit this appears way too
> low-level and inflexible for the overall design.
> So I'll make it "two-pass" indexing.

The way I did it: I'm indexing HTML documents, so before Lucene can do 
anything I need to run a HTML parser. This parser, while scanning the 
tags, builds two text strings at the same time: one that contains the 
document content for indexing and one that contains it for summarizing.

There are relevant differences between those two strings, for example in 
handling of headlines and punctuation. Lucene then gets to index the 
first string and the second is given to a summarizer. The summary it 
returns is added to a Lucene field.

This way I can do summarizing and indexing in one pass.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message