Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: local policy)
In-Reply-To: <469651EC.90701@lingway.com>
To: java-user@lucene.apache.org
Subject: Re: checking existing docs before indexing
MIME-Version: 1.0
Message-ID: 
 <OF358B6B1E.63D4FDCE-ON65257317.0024C442-65257317.0024F60F@hewitt.com>
From: "Neeraj Gupta" <neeraj.gupta.2@hewitt.com>
Date: Fri, 13 Jul 2007 12:15:08 +0530
Content-Type: multipart/alternative;
 boundary="=_alternative 0024F60B65257317_="

--=_alternative 0024F60B65257317_=
Content-Type: text/plain;
 charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

Yes, you need to store one untokenized field which will identifiy the=20
exact document you want to update.

You can also check whether any document like that exists in your indexes,=20
by using deleteDocuments() method of Indexreader. This returns the number=20
of documents deleted as per the Term provided.=20

Cheers,
Neeraj


"Samuel LEMOINE" <samuel.lemoine@lingway.com>=20

07/12/2007 09:38 PM
Please respond to
java-user@lucene.apache.org


To
java-user@lucene.apache.org
cc
heba.farouk@yahoo.com
Subject
Re: checking existing docs before indexing


Neeraj Gupta a =E9crit :
> Hi,
>
> You an use updateDocument() method of IndexWriter to update any existing=
=20

> document.. It searches for a document matching the Term, if document=20
> existes then delete that document. After that it adds the provided=20
> document to the indexes in both the cases whether document exists or=20
not.
>
> Cheers,
> Neeraj
>
>
>
>
> "Heba Farouk" <heba.farouk@yahoo.com>=20
>
> 07/12/2007 06:57 PM
> Please respond to
> java-user@lucene.apache.org, heba.farouk@yahoo.com
>
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> checking existing docs before indexing
>
>
>
>
>
>
> Hello
> i'm a newbie to lucene world and i hope that u help me.
> i was asking is there any options in IndexWriter to check if a document=20
> already exsits before adding it to the index or i should maintain it=20
> manually =3F=3F
>
> thanks in advance
>
>
> Yours=20
>
> Heba
>
>=20
> ---------------------------------
> Choose the right car based on your needs.  Check out Yahoo! Autos new=20
Car=20
> Finder tool.
>
>=20
> The information contained in this e-mail and any accompanying documents=20
may contain information that is confidential or otherwise protected from=20
disclosure. If you are not the intended recipient of this message, or if=20
this message has been addressed to you in error, please immediately alert=20
the sender by reply e-mail and then delete this message, including any=20
attachments. Any dissemination, distribution or other use of the contents=20
of this message by anyone other than the intended recipient=20
> is strictly prohibited.
>
>
>
>=20
I also used the updateDocument() to do so, but I encountered the issue=20
that it takes a term as argument, so that other documents may be deleted=20
by this method. To avoid this, my conclusion was that a solution is to=20
store some stored untokenized fields, used as keys to identify solely a=20
document, each document being identified by a string that distinguish it=20
=66rom others (such as url or file path).

Sam


PS: Here is the sample code I've wrote during my internship, quite=20
simple to grasp:
(there are no commentaries, I removed them as they were in french)
The method that could interest you is the addDocument(String) one.
Hope it helped.

public class Indexer {

    private static final Logger theLogger =3D=20
Logger.getLogger(Indexer.class);

    private Analyzer theAnalyzer;
    private IndexWriter theIndexWriter;
    private Reader theReaderContent;
    private String theIndexPath;

    public Indexer(String anIndexPath) {
        theAnalyzer =3D new StandardAnalyzer();
        theIndexPath =3D anIndexPath;
    }

    public void addDocument(String aFileName){

        try {
        theIndexWriter =3D new IndexWriter(theIndexPath, theAnalyzer);
        } catch (IOException e) {
            theLogger.error(e);
        }

        Document doc =3D new Document();

        try {
            theReaderContent =3D new FileReader(aFileName);
        } catch (FileNotFoundException e) {
            theLogger.error(e);
        }

        TokenStream tokenStreamContent =3D new=20
StandardTokenizer(theReaderContent);
        Field docPath =3D new Field("path", aFileName, Field.Store.YES,=20
=46ield.Index.UN_TOKENIZED);
        Field docContent =3D new Field("content", tokenStreamContent);
        doc.add(docPath);
        doc.add(docContent);

        try {
//            theIndexWriter.addDocument(doc);
            theIndexWriter.updateDocument(new Term("path",aFileName),doc);
            theIndexWriter.close();
        } catch (IOException e) {
            theLogger.error(e);
        }
    }

    public void sort(){
        try {
            theIndexWriter =3D new IndexWriter(theIndexPath, theAnalyzer);
            theIndexWriter.optimize();
            theIndexWriter.close();
        } catch (IOException e) {
            theLogger.error(e);
        }
    }

=20
    public void addAllDocuments(String aDirectoryPath){
        File directory =3D new File(aDirectoryPath);
        File[] subDirectory =3D directory.listFiles();
        System.out.println(subDirectory.length+" fichiers ont =E9t=E9=20
index=E9s.");
        for (File file : subDirectory) {
        addDocument(file.getPath());
        }
        this.sort();
    }
}


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
=46or additional commands, e-mail: java-user-help@lucene.apache.org


=20
The information contained in this e-mail and any accompanying documents may=
 =
contain information that is confidential or otherwise protected from =
disclosure. If you are not the intended recipient of this message, or if =
this message has been addressed to you in error, please immediately alert =
the sender by reply e-mail and then delete this message, including any =
attachments. Any dissemination, distribution or other use of the contents o=
=66=
 this message by anyone other than the intended recipient=20
is strictly prohibited.


--=_alternative 0024F60B65257317_=--