Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 21695 invoked from network); 13 Jul 2007 06:44:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Jul 2007 06:44:54 -0000 Received: (qmail 29660 invoked by uid 500); 13 Jul 2007 06:44:51 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 28674 invoked by uid 500); 13 Jul 2007 06:44:48 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 28663 invoked by uid 99); 13 Jul 2007 06:44:48 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Jul 2007 23:44:48 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [204.152.239.220] (HELO l4dupmt4.hewitt.com) (204.152.239.220) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Jul 2007 23:44:45 -0700 Received: from linkap13.hewitt.com (linkap13.hewitt.com [10.64.63.9]) by l4dupmt4.hewitt.com (8.13.4/8.13.4) with ESMTP id l6D6h27n011511 for ; Fri, 13 Jul 2007 01:44:14 -0500 (CDT) Received: from 10.20.63.62 by linkap13.hewitt.com with ESMTP (Tumbleweed MMS SMTP Relay); Fri, 13 Jul 2007 01:45:35 -0500 X-Server-Uuid: 5481ADDC-4403-4C6D-B2F2-5688E6B2910D In-Reply-To: <469651EC.90701@lingway.com> To: java-user@lucene.apache.org Subject: Re: checking existing docs before indexing MIME-Version: 1.0 X-Mailer: Lotus Notes 652HF1094 September 19, 2005 Message-ID: From: "Neeraj Gupta" Date: Fri, 13 Jul 2007 12:15:08 +0530 X-HANotesOU: Gurgaon X-MIMETrack: Serialize by Router on LINTNG1/National/Hewitt Associates(Release 6.5.6|March 06, 2007) at 07/13/2007 01:45:36 AM, Serialize complete at 07/13/2007 01:45:36 AM X-WSS-ID: 6A89C0051U812385299-01-01 Content-Type: multipart/alternative; boundary="=_alternative 0024F60B65257317_=" X-Virus-Checked: Checked by ClamAV on apache.org --=_alternative 0024F60B65257317_= Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Yes, you need to store one untokenized field which will identifiy the=20 exact document you want to update. You can also check whether any document like that exists in your indexes,=20 by using deleteDocuments() method of Indexreader. This returns the number=20 of documents deleted as per the Term provided.=20 Cheers, Neeraj "Samuel LEMOINE" =20 07/12/2007 09:38 PM Please respond to java-user@lucene.apache.org To java-user@lucene.apache.org cc heba.farouk@yahoo.com Subject Re: checking existing docs before indexing Neeraj Gupta a =E9crit : > Hi, > > You an use updateDocument() method of IndexWriter to update any existing= =20 > document.. It searches for a document matching the Term, if document=20 > existes then delete that document. After that it adds the provided=20 > document to the indexes in both the cases whether document exists or=20 not. > > Cheers, > Neeraj > > > > > "Heba Farouk" =20 > > 07/12/2007 06:57 PM > Please respond to > java-user@lucene.apache.org, heba.farouk@yahoo.com > > > > To > java-user@lucene.apache.org > cc > > Subject > checking existing docs before indexing > > > > > > > Hello > i'm a newbie to lucene world and i hope that u help me. > i was asking is there any options in IndexWriter to check if a document=20 > already exsits before adding it to the index or i should maintain it=20 > manually =3F=3F > > thanks in advance > > > Yours=20 > > Heba > >=20 > --------------------------------- > Choose the right car based on your needs. Check out Yahoo! Autos new=20 Car=20 > Finder tool. > >=20 > The information contained in this e-mail and any accompanying documents=20 may contain information that is confidential or otherwise protected from=20 disclosure. If you are not the intended recipient of this message, or if=20 this message has been addressed to you in error, please immediately alert=20 the sender by reply e-mail and then delete this message, including any=20 attachments. Any dissemination, distribution or other use of the contents=20 of this message by anyone other than the intended recipient=20 > is strictly prohibited. > > > >=20 I also used the updateDocument() to do so, but I encountered the issue=20 that it takes a term as argument, so that other documents may be deleted=20 by this method. To avoid this, my conclusion was that a solution is to=20 store some stored untokenized fields, used as keys to identify solely a=20 document, each document being identified by a string that distinguish it=20 =66rom others (such as url or file path). Sam PS: Here is the sample code I've wrote during my internship, quite=20 simple to grasp: (there are no commentaries, I removed them as they were in french) The method that could interest you is the addDocument(String) one. Hope it helped. public class Indexer { private static final Logger theLogger =3D=20 Logger.getLogger(Indexer.class); private Analyzer theAnalyzer; private IndexWriter theIndexWriter; private Reader theReaderContent; private String theIndexPath; public Indexer(String anIndexPath) { theAnalyzer =3D new StandardAnalyzer(); theIndexPath =3D anIndexPath; } public void addDocument(String aFileName){ try { theIndexWriter =3D new IndexWriter(theIndexPath, theAnalyzer); } catch (IOException e) { theLogger.error(e); } Document doc =3D new Document(); try { theReaderContent =3D new FileReader(aFileName); } catch (FileNotFoundException e) { theLogger.error(e); } TokenStream tokenStreamContent =3D new=20 StandardTokenizer(theReaderContent); Field docPath =3D new Field("path", aFileName, Field.Store.YES,=20 =46ield.Index.UN_TOKENIZED); Field docContent =3D new Field("content", tokenStreamContent); doc.add(docPath); doc.add(docContent); try { // theIndexWriter.addDocument(doc); theIndexWriter.updateDocument(new Term("path",aFileName),doc); theIndexWriter.close(); } catch (IOException e) { theLogger.error(e); } } public void sort(){ try { theIndexWriter =3D new IndexWriter(theIndexPath, theAnalyzer); theIndexWriter.optimize(); theIndexWriter.close(); } catch (IOException e) { theLogger.error(e); } } =20 public void addAllDocuments(String aDirectoryPath){ File directory =3D new File(aDirectoryPath); File[] subDirectory =3D directory.listFiles(); System.out.println(subDirectory.length+" fichiers ont =E9t=E9=20 index=E9s."); for (File file : subDirectory) { addDocument(file.getPath()); } this.sort(); } } --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org =46or additional commands, e-mail: java-user-help@lucene.apache.org =20 The information contained in this e-mail and any accompanying documents may= = contain information that is confidential or otherwise protected from = disclosure. If you are not the intended recipient of this message, or if = this message has been addressed to you in error, please immediately alert = the sender by reply e-mail and then delete this message, including any = attachments. Any dissemination, distribution or other use of the contents o= =66= this message by anyone other than the intended recipient=20 is strictly prohibited. --=_alternative 0024F60B65257317_=--