lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anuj Bhatt <anuj.bh...@gmail.com>
Subject Re: Doc IDs via IndexReader?
Date Thu, 23 Jul 2009 19:45:30 GMT
Hi,

Thanks Shai and Mike for your suggestions. I went with Shai's second
approach. However, I'm confronted with this now:

After deleting that document from the index, I also delete it from a
copy of the directory that contained the original documents. With
this, I expected that both the directory as well as the index, both
shouldn't have had the document. More precisely, I have taken this
updated directory and take each document in that directory and convert
it to a query. I then send this query to the index via IndexSearcher
and examine the hits for each document. For some reason, I get a
document which I had deleted from the index (via IndexReader). Is
there any valid explanation for this? How can I be assured that the
index will not contain that document. Here's the code snippet I am
experimenting this with (hopefully things are self explanatory):


        System.out.println("Documents which are in the whitelist :
"+docsEncounteredNames.toString());
    	IndexReader reader = IndexReader.open(indexDir);
    	
    	for(int doc_itr=0; doc_itr < reader.maxDoc(); doc_itr++)
    	{
    		if(docsEncountered.contains(doc_itr))
    		{
                       //skip if I encountered this document
    			continue;
    		}
    		else if (!reader.isDeleted(doc_itr))
    		{
    			System.out.println("Deleting document with name:
"+reader.document(doc_itr).get("filename"));
    			File docToDelete = new
File(orgDocsDir+"/"+reader.document(doc_itr).get("filename"));
    			reader.deleteDocument(doc_itr);
    			System.out.println("Also deleting original document
"+docToDelete.getCanonicalPath());
    			docToDelete.delete();
    		}
    	}

Best,
Anuj


On Thu, Jul 23, 2009 at 6:24 AM, Michael
McCandless<lucene@mikemccandless.com> wrote:
> I think you could also delete by Query (using IndexWriter), concocting
> a single large query that's something like MatchAllDocsQuery AND NOT
> (Q1 OR Q2 OR Q3...) where Q1, Q2, Q3 are the queries that identify the
> docs you want to keep.
>
> Mike
>
> On Wed, Jul 22, 2009 at 10:58 PM, Anuj Bhatt<anuj.bhatt@gmail.com> wrote:
>> Hi,
>>
>> I'm relatively new to Lucene. I have the following case: I have
>> indexed a bunch of documents. I then, query the index using
>> IndexSearcher and retrieve the documents using Hits (I do know this is
>> deprecated -- I'm using v 2.4.1). So, I do this for a set of queries
>> and maintain which documents are returned to each one. In the end of
>> it all, I have a list of documents maintained (more specifically, the
>> hits.id(some_iterator_int) associated with the doc). Now, I wish to
>> delete the documents which have not been returned for any query, from
>> the index. How can I do this?
>>
>> My initial assumption was that I could retrieve all the doc ids from
>> IndexReader and just traverse the list that I have maintained, if it
>> is in the list, I don't delete it otherwise I do. Looking around
>> didn't yield anything, and hence the mail.
>>
>>
>> Any suggestions?
>>
>>
>> Regards,
>> Anuj
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message