Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Date: Tue, 16 Dec 2003 20:57:39 +0100 (MET)
From: ambiesense@gmx.de
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
MIME-Version: 1.0
References: <047401c3c368$beff8de0$5a01a8c0@whale>
Subject: Re: Summarization; sentence-level and document-level filters.
Message-ID: <32625.1071604659@www32.gmx.net>
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8bit

Hello Gregor and Maurits,

I am not quite sure what you want to do. I think you want to search the
normal text and present the summarized text on the screen where the user is able
to get the full text on request. Is this the case?

If this is the case, then you could create a set of summarized text from the
full text, crate another index for them and have an extra field in the text
which is not summarized. You could use this field to find the summarized
version of a full text and retrieve the full text from the summarized text in
order to present it to the user. In this case you would put your summarizer
before the analyser (in terms of workflow) which would perfectly fit into the
existing concept of Lucene.

I am not sure if I could catch your idea. Please educate me further if I
missunderstood something... 

Cheers,
Ralf

> Hi Gregor,
> 
> Sofar as I know there is no summarizer in the plans. And maybe I can help
> you along the way. Have a look
> at Classifier4J project on Sourceforge.
> 
> http://classifier4j.sourceforge.net/
> 
> It has a small documetn summarizer besides a bayes classifier.It might
> speed
> up your coding.
> 
> On the level of lucene, I have no idea. My gut feeling says that a summary
> should be build before the
> text is tokenized! The tokenizer can ofcourse be used when analysing a
> document, but hooking into
> the lucene indexing is a bad idea I think.
> 
> Someone else has any ideas?
> 
> regards,
> 
> Maurits
> 
> 
> 
> 
> ----- Original Message ----- 
> From: "Gregor Heinrich" <heinrich@igd.fhg.de>
> To: "'Lucene Users List'" <lucene-user@jakarta.apache.org>
> Sent: Monday, December 15, 2003 7:41 PM
> Subject: Summarization; sentence-level and document-level filters.
> 
> 
> > Hi,
> >
> > is there any possibility to do sentence-level or document level analysis
> > with the current Analysis/TokenStream architecture? Or where else is the
> > best place to plug in customised document-level and sentence-level
> analysis
> > features? Is there any "precedence case" ?
> >
> > My technical problem:
> >
> > I'd like to include a summarization feature into my system, which should
> (1)
> > best make use of the architecture already there in Lucene, and (2)
> should
> be
> > able to trigger summarization on a per-document basis while requiring
> > sentence-level information, such as full-stops and commas. To preserve
> this
> > "punctuation", a special Tokenizer can be used that outputs such
> landmarks
> > as tokens instead of filtering them out. The actual SummaryFilter then
> > filters out the punctuation for its successors in the Analyzer's filter
> > chain.
> >
> > The other, more complex thing is the document-level information: As
> Lucene's
> > architecture uses a filter concept that does not know about the document
> the
> > tokens are generated from (which is good abstraction), a
> document-specific
> > operation like summarization is a bit of an awkward thing with this (and
> > originally not intended, I guess). On the other hand, I'd like to have
> the
> > existing filter structure in place for preprocessing of the input,
> because
> > my raw texts are generated by converters from other formats that output
> > unwanted chars (from figures, pagenumbers, etc.), which are filtered out
> > anyway by my custom Analyzer.
> >
> > Any idea how to solve this second problem? Is there any support for such
> > document / sentence structure analysis planned?
> >
> > Thanks and regards,
> >
> > Gregor
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
+++ GMX - die erste Adresse f�r Mail, Message, More +++
Neu: Preissenkung f�r MMS und FreeMMS! http://www.gmx.net


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org