Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 89209 invoked from network); 14 Dec 2008 14:56:22 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 14 Dec 2008 14:56:21 -0000 Received: (qmail 20807 invoked by uid 500); 14 Dec 2008 14:56:33 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 20779 invoked by uid 500); 14 Dec 2008 14:56:33 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 20768 invoked by uid 99); 14 Dec 2008 14:56:33 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 14 Dec 2008 06:56:33 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [69.55.225.129] (HELO ehatchersolutions.com) (69.55.225.129) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 14 Dec 2008 14:56:11 +0000 Received: by ehatchersolutions.com (Postfix, from userid 504) id 9B89330EFC16; Sun, 14 Dec 2008 07:55:50 -0700 (MST) X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on javelina X-Spam-Level: Received: from [10.0.1.198] (unknown [138.210.10.225]) by ehatchersolutions.com (Postfix) with ESMTP id A342B30EFC16 for ; Sun, 14 Dec 2008 07:55:49 -0700 (MST) Message-Id: From: Erik Hatcher To: general@lucene.apache.org In-Reply-To: <20081214141514.GA18514@campbell-lange.net> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v929.2) Subject: Re: Indexing local PDFs: Lucene/Solr/Nutch ? Date: Sun, 14 Dec 2008 09:55:48 -0500 References: <20081214141514.GA18514@campbell-lange.net> X-Mailer: Apple Mail (2.929.2) X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00 autolearn=ham version=3.1.1 The trunk of Solr with the new ExtractingRequestHandler (Tika) will surely be the easiest way to get rolling. A simple script that recurses your folders and issues a simple request posting each file in turn to Solr will give you a full text searchable index in no time (well, ok, it'll take a little time, but it'll be as fast as anything else out there). Erik On Dec 14, 2008, at 9:15 AM, Veselin Kantsev wrote: > Hello, > first of all, thanks for these great projects. > I discovered Lucene and its subs, a day ago and all these seem > amazing. > > My goal: > -------- > A file server with numerous folders containing documents > (pdf,doc,txt etc.) > that need to be indexed and searchable via a web interface or similar. > The number of files might be from 500 000 to 1 000 000 or so. > Ideally the solution would be capable of handling a lot more than > that, > in case of future growth. > > My question: > ------------ > Which of the projects (Lucene, Solr, Nutch) will be most suitable in > my case? > > Thank you much. > > -- > Veselin K