lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gregor Heinrich" <>
Subject Summarization; sentence-level and document-level filters.
Date Mon, 15 Dec 2003 18:41:10 GMT

is there any possibility to do sentence-level or document level analysis
with the current Analysis/TokenStream architecture? Or where else is the
best place to plug in customised document-level and sentence-level analysis
features? Is there any "precedence case" ?

My technical problem:

I'd like to include a summarization feature into my system, which should (1)
best make use of the architecture already there in Lucene, and (2) should be
able to trigger summarization on a per-document basis while requiring
sentence-level information, such as full-stops and commas. To preserve this
"punctuation", a special Tokenizer can be used that outputs such landmarks
as tokens instead of filtering them out. The actual SummaryFilter then
filters out the punctuation for its successors in the Analyzer's filter

The other, more complex thing is the document-level information: As Lucene's
architecture uses a filter concept that does not know about the document the
tokens are generated from (which is good abstraction), a document-specific
operation like summarization is a bit of an awkward thing with this (and
originally not intended, I guess). On the other hand, I'd like to have the
existing filter structure in place for preprocessing of the input, because
my raw texts are generated by converters from other formats that output
unwanted chars (from figures, pagenumbers, etc.), which are filtered out
anyway by my custom Analyzer.

Any idea how to solve this second problem? Is there any support for such
document / sentence structure analysis planned?

Thanks and regards,


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message