hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hrishikesh Agashe <hrishikesh_aga...@persistent.co.in>
Subject Lucene + Hadoop
Date Tue, 10 Nov 2009 09:56:33 GMT

I am trying to use Hadoop for Lucene index creation. I have to create multiple indexes based
on contents of the files (i.e. if author is "hrishikesh", it should be added to a index for
"hrishikesh". There has to be a separate index for every author). For this, I am keeping multiple
IndexWriter open for every author and maintaining them in a hashmap in map() function. I parse
incoming file and if I see author is one for which I already have opened a IndexWriter, I
just add this file in that index, else I create a new IndesWriter for new author. As authors
might run into thousands, I am closing IndexWriter and clearing hashmap once it reaches a
certain threshold and starting all over again. There is no reduced function.

Does this logic sound correct? Is there any other way of implementing this requirement?


This e-mail may contain privileged and confidential information which is the property of Persistent
Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed.
If you are not the intended recipient, you are not authorized to read, retain, copy, print,
distribute or use this message. If you have received this communication in error, please notify
the sender and delete all copies of this message. Persistent Systems Ltd. does not accept
any liability for virus infected mails.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message