Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 25423 invoked from network); 16 Dec 2003 19:58:13 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 16 Dec 2003 19:58:13 -0000 Received: (qmail 46819 invoked by uid 500); 16 Dec 2003 19:57:37 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 46748 invoked by uid 500); 16 Dec 2003 19:57:36 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 46657 invoked from network); 16 Dec 2003 19:57:36 -0000 Received: from unknown (HELO mail.gmx.net) (213.165.64.20) by daedalus.apache.org with SMTP; 16 Dec 2003 19:57:36 -0000 Received: (qmail 3616 invoked by uid 0); 16 Dec 2003 19:57:39 -0000 Received: from 193.63.235.44 by www32.gmx.net with HTTP; Tue, 16 Dec 2003 20:57:39 +0100 (MET) Date: Tue, 16 Dec 2003 20:57:39 +0100 (MET) From: ambiesense@gmx.de To: "Lucene Users List" MIME-Version: 1.0 References: <047401c3c368$beff8de0$5a01a8c0@whale> Subject: Re: Summarization; sentence-level and document-level filters. X-Priority: 3 (Normal) X-Authenticated: #17441361 Message-ID: <32625.1071604659@www32.gmx.net> X-Mailer: WWW-Mail 1.6 (Global Message Exchange) X-Flags: 0001 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 8bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Hello Gregor and Maurits, I am not quite sure what you want to do. I think you want to search the normal text and present the summarized text on the screen where the user is able to get the full text on request. Is this the case? If this is the case, then you could create a set of summarized text from the full text, crate another index for them and have an extra field in the text which is not summarized. You could use this field to find the summarized version of a full text and retrieve the full text from the summarized text in order to present it to the user. In this case you would put your summarizer before the analyser (in terms of workflow) which would perfectly fit into the existing concept of Lucene. I am not sure if I could catch your idea. Please educate me further if I missunderstood something... Cheers, Ralf > Hi Gregor, > > Sofar as I know there is no summarizer in the plans. And maybe I can help > you along the way. Have a look > at Classifier4J project on Sourceforge. > > http://classifier4j.sourceforge.net/ > > It has a small documetn summarizer besides a bayes classifier.It might > speed > up your coding. > > On the level of lucene, I have no idea. My gut feeling says that a summary > should be build before the > text is tokenized! The tokenizer can ofcourse be used when analysing a > document, but hooking into > the lucene indexing is a bad idea I think. > > Someone else has any ideas? > > regards, > > Maurits > > > > > ----- Original Message ----- > From: "Gregor Heinrich" > To: "'Lucene Users List'" > Sent: Monday, December 15, 2003 7:41 PM > Subject: Summarization; sentence-level and document-level filters. > > > > Hi, > > > > is there any possibility to do sentence-level or document level analysis > > with the current Analysis/TokenStream architecture? Or where else is the > > best place to plug in customised document-level and sentence-level > analysis > > features? Is there any "precedence case" ? > > > > My technical problem: > > > > I'd like to include a summarization feature into my system, which should > (1) > > best make use of the architecture already there in Lucene, and (2) > should > be > > able to trigger summarization on a per-document basis while requiring > > sentence-level information, such as full-stops and commas. To preserve > this > > "punctuation", a special Tokenizer can be used that outputs such > landmarks > > as tokens instead of filtering them out. The actual SummaryFilter then > > filters out the punctuation for its successors in the Analyzer's filter > > chain. > > > > The other, more complex thing is the document-level information: As > Lucene's > > architecture uses a filter concept that does not know about the document > the > > tokens are generated from (which is good abstraction), a > document-specific > > operation like summarization is a bit of an awkward thing with this (and > > originally not intended, I guess). On the other hand, I'd like to have > the > > existing filter structure in place for preprocessing of the input, > because > > my raw texts are generated by converters from other formats that output > > unwanted chars (from figures, pagenumbers, etc.), which are filtered out > > anyway by my custom Analyzer. > > > > Any idea how to solve this second problem? Is there any support for such > > document / sentence structure analysis planned? > > > > Thanks and regards, > > > > Gregor > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > -- +++ GMX - die erste Adresse f�r Mail, Message, More +++ Neu: Preissenkung f�r MMS und FreeMMS! http://www.gmx.net --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org