lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vijay Veeraraghavan <vijay.raghava...@gmail.com>
Subject Re: Indexing only newly created files
Date Mon, 03 May 2010 08:02:25 GMT
dear,
Thanks for you reply Mr. simon, I found it very useful.
I have another doubt, I create the index in a clustered environment (2
physical systems and 2 virtual). A shared system among the nodes is
where this index will be created. The scheduler runs in another remote
system which will create and save the pdfs to a remote file server. Is
it advisable to create a local index and add it the main index shared
by the nodes? Or create a local index and copy it to the nodes?

Thanks
Vijay

On 5/3/10, Simon Willnauer <simon.willnauer@googlemail.com> wrote:
> Hey there,
>
> you might have to implement a some kind of unique identifier using an
> indexed lucene field. When you are indexing you should fire a query with the
> uuid of your document (maybe the path to you pdf document) and check if the
> document is in the index already. You could also do a boolean query
> combining UUID, timestamp and / or a hash value to see if the document has
> been changed. if so you can simply update the document by its UUID
> (something like indexwriter.updateDocument(new Term("uuid",
> value),document);)
>
> Unfortunately you have to implement this yourself but it should not be that
> much of a deal.
>
> simon
>
> On Mon, May 3, 2010 at 9:21 AM, Vijay Veeraraghavan <
> vijay.raghavan08@gmail.com> wrote:
>
>> Dear all,
>> I am using lucene 3.0 to index the pdf reports that I generate
>> dynamically. I index the pdf file name (without extension), file path
>> and its absolute path as fields. I search with the file name without
>> extension; it retrieves a list, as usually 2 or more files are present
>> in the same name in different sub directories. As I create the index
>> for the first time it updates, assuming 100 pdf files in different
>> directories, the files meta info. If again I do indexing, while my
>> report generator scheduler has the produced 500 more pdf files
>> totaling to 600 files in different directories, I wish to index only
>> the new files to the index. But presently it’s doing the whole thing
>> again (600 files). How to implement this functionality? Think of the
>> thousands of pdf files created on each run.
>>
>> P.S: I cannot keep the meta-info of generated pdf files in the java
>> memory, as it exceeds thousands in a single run, and update the index
>> looping this list.
>>
>> new IndexWriter(FSDirectory.open(this.indexDir), new StandardAnalyzer(
>>                                        Version.LUCENE_CURRENT), true,
>>
>> IndexWriter.MaxFieldLength.LIMITED);
>>
>> is the boolean parameter is for this purpose? Please guide me.
>>
>> --
>> Thanks
>> Vijay Veeraraghavan
>>
>>
>>
>> --
>> Thanks & Regards
>> Vijay Veeraraghavan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>


-- 
Thanks & Regards
Vijay Veeraraghavan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message