lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vijay Veeraraghavan <>
Subject Using IndexReader in the web environment
Date Mon, 03 May 2010 10:21:12 GMT
Hi all,

In a clustered environment I search the index from the web
application. In the web application I am creating IndexReader on each
request. is it expensive to do like this? I read somewhere in the web
that try using the same reader as much as possible. Can i keep the
initially created IndexReader in the session/application scopes and
use the same for each request? Any other idea?


On 5/3/10, Vijay Veeraraghavan <> wrote:
> dear all,
> as replied below, does searching again for the document in the index
> and if found skip the indexing else index it, is this not similar to
> indexing all pdf documents once again, is not this overhead? As I am
> not going to index the details of the pdf (so if an indexed pdf was
> recreated i need not reindex it) but just the paths of the documents.
> Vijay
>>> Hey there,
>>> you might have to implement a some kind of unique identifier using an
>>> indexed lucene field. When you are indexing you should fire a query with
>>> the
>>> uuid of your document (maybe the path to you pdf document) and check if
>>> the
>>> document is in the index already. You could also do a boolean query
>>> combining UUID, timestamp and / or a hash value to see if the document
>>> has
>>> been changed. if so you can simply update the document by its UUID
>>> (something like indexwriter.updateDocument(new Term("uuid",
>>> value),document);)
>>> Unfortunately you have to implement this yourself but it should not be
>>> that
>>> much of a deal.
>>> simon
>>> On Mon, May 3, 2010 at 9:21 AM, Vijay Veeraraghavan <
>>>> wrote:
>>>> Dear all,
>>>> I am using lucene 3.0 to index the pdf reports that I generate
>>>> dynamically. I index the pdf file name (without extension), file path
>>>> and its absolute path as fields. I search with the file name without
>>>> extension; it retrieves a list, as usually 2 or more files are present
>>>> in the same name in different sub directories. As I create the index
>>>> for the first time it updates, assuming 100 pdf files in different
>>>> directories, the files meta info. If again I do indexing, while my
>>>> report generator scheduler has the produced 500 more pdf files
>>>> totaling to 600 files in different directories, I wish to index only
>>>> the new files to the index. But presently it’s doing the whole thing
>>>> again (600 files). How to implement this functionality? Think of the
>>>> thousands of pdf files created on each run.
>>>> P.S: I cannot keep the meta-info of generated pdf files in the java
>>>> memory, as it exceeds thousands in a single run, and update the index
>>>> looping this list.
>>>> new IndexWriter(, new StandardAnalyzer(
>>>>                                        Version.LUCENE_CURRENT), true,
>>>> IndexWriter.MaxFieldLength.LIMITED);
>>>> is the boolean parameter is for this purpose? Please guide me.
>>>> --
>>>> Thanks
>>>> Vijay Veeraraghavan
>>>> --
>>>> Thanks & Regards
>>>> Vijay Veeraraghavan
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> For additional commands, e-mail:
>> --
>> Thanks & Regards
>> Vijay Veeraraghavan
> --
> Thanks & Regards
> Vijay Veeraraghavan

Thanks & Regards
Vijay Veeraraghavan

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message