lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Summarization; sentence-level and document-level filters.
Date Tue, 16 Dec 2003 17:57:00 GMT
It sounds like you want the value of a stored field (a summary) to be 
built from the tokens of another field of the same document.  Is that 
right?  This is not presently possible without tokenizing the field 
twice, once to produce its summary and once again when indexing.


Gregor Heinrich wrote:
> Hi,
> is there any possibility to do sentence-level or document level analysis
> with the current Analysis/TokenStream architecture? Or where else is the
> best place to plug in customised document-level and sentence-level analysis
> features? Is there any "precedence case" ?
> My technical problem:
> I'd like to include a summarization feature into my system, which should (1)
> best make use of the architecture already there in Lucene, and (2) should be
> able to trigger summarization on a per-document basis while requiring
> sentence-level information, such as full-stops and commas. To preserve this
> "punctuation", a special Tokenizer can be used that outputs such landmarks
> as tokens instead of filtering them out. The actual SummaryFilter then
> filters out the punctuation for its successors in the Analyzer's filter
> chain.
> The other, more complex thing is the document-level information: As Lucene's
> architecture uses a filter concept that does not know about the document the
> tokens are generated from (which is good abstraction), a document-specific
> operation like summarization is a bit of an awkward thing with this (and
> originally not intended, I guess). On the other hand, I'd like to have the
> existing filter structure in place for preprocessing of the input, because
> my raw texts are generated by converters from other formats that output
> unwanted chars (from figures, pagenumbers, etc.), which are filtered out
> anyway by my custom Analyzer.
> Any idea how to solve this second problem? Is there any support for such
> document / sentence structure analysis planned?
> Thanks and regards,
> Gregor
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message