lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shaun Senecal <shaun.sene...@lithium.com>
Subject manually merging Directories
Date Tue, 23 Dec 2014 22:55:20 GMT
Hi

I have a number of Directories which are stored in various paths on HDFS, and I would like
to merge them into a single index.  The obvious way to do this is to use IndexWriter.addIndexes(...),
however, I'm hoping I can do better.  Since I have created each of the separate indexes using
Map/Reduce, I know that there are no deleted or duplicate documents and the codecs are the
same.  Using addIndexes(...) will incur a lot of I/O as it copies from the source Directory
into the dest Directory, and this is the bit I would like to avoid.  Would it instead be possible
to simply move each of the segments from each path into a single path on HDFS using a mv/rename
operation instead?  Obviously I would need to take care of the naming to ensure the files
from one index dont overwrite another's, but it looks like this is done with a counter of
some sort so that the latest segment can be found. A potential complication is the segments_1
file, as I'm not sure what that is for or if I can easily (re)construct it externally.

The end goal here is to index using Map/Reduce and then spit out a single index in the end
that has been merged down to a single segment, and to minimize IO while doing it.  Once I
have the completed index in a single Directory, I can (optionally) perform the forced merge
(which will incur a huge IO hit).  If the forced merge isnt performed on HDFS, it could be
done on the search nodes before the active searcher is switched.  This may be better if, for
example, you know all of your search nodes have SSDs and IO to spare.?

Just in case my explanation above wasn't clear enough, here is a picture

What I have:

/user/username/MR_output/0
  _0.fdt
  _0.fdx
  _0.fnm
  _0.si
  ...
  segments_1

/user/username/MR_output/1
  _0.fdt
  _0.fdx
  _0.fnm
  _0.si
  ...
  segments_1


What I want (using simple mv/rename):

/user/username/merged
  _0.fdt
  _0.fdx
  _0.fnm
  _0.si
  ...
  _1.fdt
  _1.fdx
  _1.fnm
  _1.si
  ...
  segments_1




Thanks,

Shaun?


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message