lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gregor Heinrich" <>
Subject RE: Summarization; sentence-level and document-level filters.
Date Tue, 16 Dec 2003 20:31:42 GMT
Yes, copying a summary from one field to an untokenized field was the plan.

I identified DocumentWriter.invertDocument() to be a possible place for an
addition of this document-level analysis. But I admit this appears way too
low-level and inflexible for the overall design.

So I'll make it "two-pass" indexing.

Thanks for the decision support,


-----Original Message-----
From: Doug Cutting []
Sent: Tuesday, December 16, 2003 6:57 PM
To: Lucene Users List
Subject: Re: Summarization; sentence-level and document-level filters.

It sounds like you want the value of a stored field (a summary) to be
built from the tokens of another field of the same document.  Is that
right?  This is not presently possible without tokenizing the field
twice, once to produce its summary and once again when indexing.


Gregor Heinrich wrote:
> Hi,
> is there any possibility to do sentence-level or document level analysis
> with the current Analysis/TokenStream architecture? Or where else is the
> best place to plug in customised document-level and sentence-level
> features? Is there any "precedence case" ?
> My technical problem:
> I'd like to include a summarization feature into my system, which should
> best make use of the architecture already there in Lucene, and (2) should
> able to trigger summarization on a per-document basis while requiring
> sentence-level information, such as full-stops and commas. To preserve
> "punctuation", a special Tokenizer can be used that outputs such landmarks
> as tokens instead of filtering them out. The actual SummaryFilter then
> filters out the punctuation for its successors in the Analyzer's filter
> chain.
> The other, more complex thing is the document-level information: As
> architecture uses a filter concept that does not know about the document
> tokens are generated from (which is good abstraction), a document-specific
> operation like summarization is a bit of an awkward thing with this (and
> originally not intended, I guess). On the other hand, I'd like to have the
> existing filter structure in place for preprocessing of the input, because
> my raw texts are generated by converters from other formats that output
> unwanted chars (from figures, pagenumbers, etc.), which are filtered out
> anyway by my custom Analyzer.
> Any idea how to solve this second problem? Is there any support for such
> document / sentence structure analysis planned?
> Thanks and regards,
> Gregor
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message