Return-Path: Delivered-To: apmail-jackrabbit-users-archive@locus.apache.org Received: (qmail 43116 invoked from network); 3 May 2007 01:41:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 May 2007 01:41:26 -0000 Received: (qmail 23678 invoked by uid 500); 3 May 2007 01:41:31 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 23660 invoked by uid 500); 3 May 2007 01:41:31 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 23651 invoked by uid 99); 3 May 2007 01:41:31 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 May 2007 18:41:31 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [216.170.99.246] (HELO mail.authsum.com) (216.170.99.246) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 May 2007 18:41:24 -0700 Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.authsum.com (Postfix) with ESMTP id 6614A78237 for ; Wed, 2 May 2007 21:41:05 -0400 (EDT) X-Virus-Scanned: amavisd-new at X-Spam-Score: -4.015 X-Spam-Level: Received: from mail.authsum.com ([127.0.0.1]) by localhost (mail.authsum.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id JKCwnedWsqWI for ; Wed, 2 May 2007 21:41:04 -0400 (EDT) Received: from mail.authsum.com (mail.authsum.com [216.170.99.246]) by mail.authsum.com (Postfix) with ESMTP id B735C78233 for ; Wed, 2 May 2007 21:41:04 -0400 (EDT) Message-ID: <7199905.3761178156464412.JavaMail.root@mail.rhoderunner.com> Date: Wed, 2 May 2007 21:41:04 -0400 (EDT) From: Phillip Rhodes To: users@jackrabbit.apache.org Subject: Re: JackRabbit Search Engine Questions In-Reply-To: <6.2.3.4.2.20070502134010.01d54e88@mail.jpl.nasa.gov> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [67.186.34.237] X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=-4.015 tagged_above=-10 required=6.6 tests=[ALL_TRUSTED=-1.8, AWL=0.384, BAYES_00=-2.599] I am a newbie, but see my answers below ----- Original Message ----- From: "Belinda Randolph" To: users@jackrabbit.apache.org Sent: Wednesday, May 2, 2007 4:42:23 PM (GMT-0500) America/New_York Subject: JackRabbit Search Engine Questions I am in the process of evaluating 10 repository solutions for my project. I have several questions to ask in order to make my decisions. 1. Can I replace the JackRabbit search engine with my own? yes you can, but why would you? I migrated to jackrabbit just so I could retire all my search code (written in lucene). Of course, you could write you own crawler/indexer to access the jackrabbit repository. 2. Does your search engine look through actual document contents - as a background process or at the time of the actual user search? The document is indexed when it is added to the repository. When the user searches, it is executing the search against a previouly built index. Very fast. 3. What FORMATs of actual documents does your search engine look at? (Ascii, Microsoft, PDF, etc.) All those formats, and more. You can easily create new ones if you like, you will have to set the mime type on the content that you add to get your custom indexer to run against it. 4. When searching the contents of a PDF file, does the background process, using OCR, create an additional file in another format? What format? When the pdf is added to jackrabbit, text is extracted from the pdf and added to the search index. No OCR involved. Just text extracted from the pdf. If the PDF contains only images, it will not do any ocr on those images. 5. Does your OCR routine search FORMATS other than PDF? If yes, what formats can the OCR search? There is no OCR technology involved, rather the text in the microsoft word document, etc. is extracted from the file using a library that understands the MS/PDF binary file format, so no OCR is necessary. 6. What are the resolution requirements for your OCR routines? NO OCR involved with jackrabbit, but keep in mind, that we have the use of libraries that understand MS Word/PDF/Etc formats that can extract the textual content of the files. 7. Can I change the GUI to a) add functionality or error checking and b) to look personalized with CSS? Jackrabbit does not have a gui, so you are in total control of it. Some folks (like me) have application components written that allow easy creating of gui's to read/write/access jackrabbit content. 8. Can the search engine search using both requested metadata element values and keywords from the document contents? yes. 9. Can I start with keywords from the document contents and then later filter the results using user inputted metadata element values? yes. 10. Can I start with user input metadata element values and then later filter down the results with document contents? Yes. 11. After an initial search, can I refine my search by only looking at the results of the previous search? don't know. Thanks, Belinda