lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@googlemail.com>
Subject Re: Indexing only newly created files
Date Mon, 03 May 2010 07:32:56 GMT
Hey there,

you might have to implement a some kind of unique identifier using an
indexed lucene field. When you are indexing you should fire a query with the
uuid of your document (maybe the path to you pdf document) and check if the
document is in the index already. You could also do a boolean query
combining UUID, timestamp and / or a hash value to see if the document has
been changed. if so you can simply update the document by its UUID
(something like indexwriter.updateDocument(new Term("uuid",
value),document);)

Unfortunately you have to implement this yourself but it should not be that
much of a deal.

simon

On Mon, May 3, 2010 at 9:21 AM, Vijay Veeraraghavan <
vijay.raghavan08@gmail.com> wrote:

> Dear all,
> I am using lucene 3.0 to index the pdf reports that I generate
> dynamically. I index the pdf file name (without extension), file path
> and its absolute path as fields. I search with the file name without
> extension; it retrieves a list, as usually 2 or more files are present
> in the same name in different sub directories. As I create the index
> for the first time it updates, assuming 100 pdf files in different
> directories, the files meta info. If again I do indexing, while my
> report generator scheduler has the produced 500 more pdf files
> totaling to 600 files in different directories, I wish to index only
> the new files to the index. But presently it’s doing the whole thing
> again (600 files). How to implement this functionality? Think of the
> thousands of pdf files created on each run.
>
> P.S: I cannot keep the meta-info of generated pdf files in the java
> memory, as it exceeds thousands in a single run, and update the index
> looping this list.
>
> new IndexWriter(FSDirectory.open(this.indexDir), new StandardAnalyzer(
>                                        Version.LUCENE_CURRENT), true,
>                                        IndexWriter.MaxFieldLength.LIMITED);
>
> is the boolean parameter is for this purpose? Please guide me.
>
> --
> Thanks
> Vijay Veeraraghavan
>
>
>
> --
> Thanks & Regards
> Vijay Veeraraghavan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message