Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A7DB297E4 for ; Wed, 7 Mar 2012 11:27:09 +0000 (UTC) Received: (qmail 31474 invoked by uid 500); 7 Mar 2012 11:27:07 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 31421 invoked by uid 500); 7 Mar 2012 11:27:06 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 31404 invoked by uid 99); 7 Mar 2012 11:27:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Mar 2012 11:27:06 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ian.lea@gmail.com designates 209.85.215.176 as permitted sender) Received: from [209.85.215.176] (HELO mail-ey0-f176.google.com) (209.85.215.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Mar 2012 11:27:01 +0000 Received: by eaai1 with SMTP id i1so2325057eaa.35 for ; Wed, 07 Mar 2012 03:26:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=x9Z2F19IZI5/YR02UcGD0eGQ0FL34I00LSzMb1w/jfU=; b=0Mr1nJV2ZSONJ4cUmGkM0YczZTIdJt3Oe8Ns5NWYUsus2PBKAKgsn3ncwhmYH/qMBt 5ICyH8Ca0HF25CzgrjYq5ZRHJFSe9/t5nBuRsJjIV/9tGipuj3Mv9NxWPcge6IDvKt8g uEafKdVtp8OLE+sZKoS5S1hKkDb9keV71CePGTc2YZRieCxdY0Mza0C2SzaBVi6/53Gi 5OzgD8iPy25W+Bpoe+hejSQE5AqKD3HZqL51+qQ5q2y5EFhwVMWCH+u8Zg1JUP+HFrcC M2x16Yfglt6/D3VHJFj0HYI73WQbxLfZxDxAzTPPEVq3O+eKlbGarlVUc/vNEGB17syZ 6/2g== Received: by 10.213.13.72 with SMTP id b8mr471578eba.180.1331119600282; Wed, 07 Mar 2012 03:26:40 -0800 (PST) MIME-Version: 1.0 Received: by 10.213.35.138 with HTTP; Wed, 7 Mar 2012 03:26:20 -0800 (PST) In-Reply-To: <3D7F018025EA1F429F25962058105DA70845D673@inhydnt11.ness.com> References: <3D7F018025EA1F429F25962058105DA70845D57F@inhydnt11.ness.com> <3D7F018025EA1F429F25962058105DA70845D673@inhydnt11.ness.com> From: Ian Lea Date: Wed, 7 Mar 2012 11:26:20 +0000 Message-ID: Subject: Re: Help on DOCX and XLSX To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org So you want to index different fields and search on those fields and are asking whether you can do that in lucene? The answer is yes. I still think you should look at Solr but if you are determined to use Lucene, get hold of a copy of the second edition of Lucene In Action http://www.manning.com/hatcher3/. -- Ian. On Wed, Mar 7, 2012 at 11:13 AM, Prasad KVSH wro= te: > Hi Ian, > > Thanks for your quick reply. > > Our documents will have the following common key information like > > 1. Document Type ID, > 2. Document Date, > 3. Document Author ID, > 4. Document Status > 5. Document Group ID. > > While creating the indexing, we would like to add the above key values > along the content index. So that it will not read entire index and > search on Document Type ID =A0or Date Range. =A0Can we implement this > approach? > > Currently search text is being performed on indexing, then we are > filtering the documents by reading document record from database table > for the above key values. > > Thanks > Prasad > > > > -----Original Message----- > From: Ian Lea [mailto:ian.lea@gmail.com] > Sent: Wednesday, March 07, 2012 4:03 PM > To: java-user@lucene.apache.org > Subject: Re: Help on DOCX and XLSX > > You'll have to find something that parses the formats you are interested > in and extracts the text you want. =A0Apache Tika comes to mind. > > Why are you using such an old version of Lucene? =A0Why aren't you using > Solr? =A0That might just work for you out of the box. =A0See also > http://www.lucidimagination.com/devzone/technical-articles/content-extra > ction-tika > > As for the size, I wouldn't worry about it. =A0Disk space is cheap. =A0If > you really do care, scan the FAQ at > http://wiki.apache.org/lucene-java/LuceneFAQ. =A0Lots of useful info on > all sorts of things. > > > -- > Ian. > > > On Wed, Mar 7, 2012 at 9:40 AM, Prasad KVSH > wrote: >> Dear All, >> >> >> >> We started using Lucene version 3.0.3, we have different types of >> documents like PDF, XLS, XLSX, DOC, DOCX,TXT etc., at a specified >> folder. >> >> >> >> We have created index on these files(using IndexFiles.java), Indexing >> has took 17.2 MB for 69.4MB Documents. This index created using >> Standard Analyzer with limited index fields. And able to search a >> given text in PDF(text content only), *.doc and *.xls(MS Word >> 1997-2003) versions only. >> >> >> >> Now I need help on .docx and .xlsx files indexing. How I can run >> indexing on these files. These files are ignored when we do a string >> search >> >> >> >> Writer is defined as below: >> >> IndexWriter writer =3D new IndexWriter(FSDirectory.open(INDEX_DIR), new >> StandardAnalyzer(Version.LUCENE_CURRENT), true, >> IndexWriter.MaxFieldLength.LIMITED); >> >> >> >> Another question is on the size of index folder, whether we can >> optimize the size >> >> >> >> Thanks >> >> Prasad >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org