Return-Path: Delivered-To: apmail-jackrabbit-dev-archive@www.apache.org Received: (qmail 41475 invoked from network); 18 Nov 2008 12:10:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 18 Nov 2008 12:10:38 -0000 Received: (qmail 6519 invoked by uid 500); 18 Nov 2008 12:10:45 -0000 Delivered-To: apmail-jackrabbit-dev-archive@jackrabbit.apache.org Received: (qmail 6478 invoked by uid 500); 18 Nov 2008 12:10:44 -0000 Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jackrabbit.apache.org Delivered-To: mailing list dev@jackrabbit.apache.org Received: (qmail 6467 invoked by uid 99); 18 Nov 2008 12:10:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Nov 2008 04:10:44 -0800 X-ASF-Spam-Status: No, hits=-4.0 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [194.8.61.7] (HELO spamslammer1.tirol.gv.at) (194.8.61.7) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Nov 2008 12:09:22 +0000 Received: from mailscan1.tirol.local (unknown [10.10.128.204]) by spamslammer1.tirol.gv.at (BorderWare Security Platform) with ESMTP id 6499E1A14DA59216 for ; Tue, 18 Nov 2008 13:09:34 +0100 (CET) Received: from mxs0.tirol.local (unverified) by mailscan1.tirol.local (Clearswift SMTPRS 5.2.9) with ESMTP id for ; Tue, 18 Nov 2008 13:09:31 +0100 Received: from mxs01.tirol.local ([10.10.128.211]) by mxs0.tirol.local with Microsoft SMTPSVC(6.0.3790.1830); Tue, 18 Nov 2008 13:09:33 +0100 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: AW: Workspace.copy() Question ... Date: Tue, 18 Nov 2008 13:09:30 +0100 Message-ID: In-Reply-To: <510143ac0811171001k18d5e056red4425d7d48b309d@mail.gmail.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Workspace.copy() Question ... Thread-Index: AclI3pC1mNZ9okpYQm2QiCR5aWOCyAAlEGSg References: <91f3b2650811110106x3c2b576al25ed830de10aa470@mail.gmail.com> <510143ac0811120310s4f6bdbe8g5f6ec8361c00c6c4@mail.gmail.com> <91f3b2650811120636t63fe0d1al7f4952005e064774@mail.gmail.com> <91f3b2650811170102h2933402boca2960b8543f07de@mail.gmail.com> <510143ac0811170143t740091d1wbe8c9a47bc103bca@mail.gmail.com> <91f3b2650811170207s15084536q8c96230496df9810@mail.gmail.com> <510143ac0811171001k18d5e056red4425d7d48b309d@mail.gmail.com> From: =?iso-8859-1?Q?K=D6LL_Claus?= To: X-OriginalArrivalTime: 18 Nov 2008 12:09:33.0337 (UTC) FILETIME=[846B6890:01C94976] X-Virus-Checked: Checked by ClamAV on apache.org hi guys, for my understanding ... >Probably not in the Lucene index files itself. Text extraction could be = used without using the Lucene index, for example to display the text = content of a >PDF file. The text extraction module could store the = DataIdentifier together with the extracted text ('payload'). The = advantage to store this 'payload' >near the actual binary is that the = data is deleted when the binary is garbage collected. So maybe it = actually is better to store the 'payload' (extracted >text, virus = scanner flag, thumbnail) near the binary, so it is automatically garbage = collected when the binary is garbage collected. We would need to >define = an API and the behavior for this 'payload storage'. It probably doesn't = need to be transactional, but it needs to be consistent (a checksum). = >Some kind of binary properties file maybe, with put(String key, = InputStream payload), and InputStream get(String key). ... thomas you are talking about a textextraction module but i can not = follow you, As far as i understand, you will change the architectur going to modules such like textactraction, viruscanner or thumbnailbuilder so they can = store the informations with the dataidentifier and the result in the datastore ? if the gc will delete a entry in the datastore based on the = dataIdentifier the "near" informations will also be deleted automatilally ? as you wrote the main problem is that we do not know if we have already = processed a binary it would be fine if we internally create a dataIdentifier of a binary = stream and give them to such modules like textextractor or what else so they can look for a = result that was already be processed and stored. is that also what you think ;-)