lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: manually merging Directories
Date Tue, 30 Dec 2014 13:19:52 GMT
Hi Shaun,

you can actually do this relatively simple. In fact, most of the files are indeed copied as-is,
so you can theoretically change the logic to make a simple rename. Files that cannot be copied
unmodified and need to be changed by IndexWriter, will be handled as usual.

You don't need to patch Lucene for this: IndexWriter calls Directory#copy(Directory to, String
src, String dest, IOContext context) for those files that can be copied unmodified. What you
need to do is: Just care a oal.store.FilterDirectory that wraps the original FSDirectory and
implement this copy method on it to just do a rename, like:

public class RenameInsteadCopyFilterDirectory extends FilterDirectory {
  public RenameInsteadCopyFilterDirectory(FSDirectory dir) {
    super(dir);
  }

  public void copy(Directory to, String src, String dest, IOContext context) throws IOException
{
    if (!(to instanceof FSDirectory)) {
     throw new IOException("This only works for target FSDirectories");
    final FSDirectory fromFS = (FSDirectory) this.getDelegate(), toFS = (FSDirectory) to;
    Files.move(fromFS.getDirectory().resolve(source), toFS.getDirectory().resolve(dest));
  }
}

Please be aware that you have to wrap the "source" directory, because IndexWriter's copySegmentAsIs()
call this method of the directory that’s passed to addIndexes(Directory). Something like:

writer.addIndexes(new RenameInsteadCopyFilterDirectory(originalDir));

After that all files, that were not copied unmodified, keep alive in the source directory,
but all those that are copied as-is will move and disappear from source directory.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Shaun Senecal [mailto:shaun.senecal@lithium.com]
> Sent: Tuesday, December 30, 2014 12:37 AM
> To: Lucene Users
> Subject: Re: manually merging Directories
> 
> Hi Mike
> 
> That's actually what I was looking at doing, I was just hoping there was a way
> to avoid the "copySegmentAsIs" step and simply replace it with a "rename"
> operation on the file system.  It seemed like low hanging fruit, but Uwe and
> Erick have now told me that the segments have dependencies embedded in
> them somehow, so a simple rename operation wouldn't accomplish the
> same thing.  In the end, it may not be a big deal anyway.
> 
> 
> Thanks
> 
> Shaun
> 
> 
> ________________________________________
> From: Michael McCandless <lucene@mikemccandless.com>
> Sent: December 29, 2014 2:43 PM
> To: Lucene Users
> Subject: Re: manually merging Directories
> 
> Why not use IW.addIndexes(Directory[])?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Mon, Dec 29, 2014 at 12:44 PM, Uwe Schindler <uwe@thetaphi.de>
> wrote:
> > Hi,
> >
> > Why not simply leave each index directory on the searcher nodes as is:
> > Move all index directories (as mentioned by you) to a local disk and access
> them using a MultiReader - there is no need to merge them if you have not
> enough resources. If you have enough CPU and IO power, just merge them
> as usual with IndexWriter.addIndexes(). But I don't understand you
> argument with I/O: If you copy the index files from HDFS to local disks
> already, how can this work without I/O? So you can merge them anyways.
> >
> > Merging index files, simply by copying them all in one directory, is
> impossible, because the files reference each other by segment name
> (segments_n refers to them, also the segment ids are used all over). So You
> would need to change some index files already for merge to make the
> SegmentInfos structures use the correct names, so you can do a real merge
> anyways.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> >> -----Original Message-----
> >> From: Shaun Senecal [mailto:shaun.senecal@lithium.com]
> >> Sent: Monday, December 29, 2014 6:34 PM
> >> To: java-user
> >> Subject: Re: manually merging Directories
> >>
> >> I'm not worried about the I/O right now, I'm "hoping I can do
> >> better", that's all.  It sounds like the only actual complication
> >> here is building the segments_N file, which would list all of the
> >> newly renamed segments, so perhaps this isn't impossible.  That said,
> >> you're absolutely right about the possibility of complications, so
> >> its debatable if doing something like this would be worth it in the
> >> end.  Thanks for the info
> >>
> >>
> >>
> >> Shaun
> >>
> >>
> >> ________________________________________
> >> From: Erick Erickson <erickerickson@gmail.com>
> >> Sent: December 23, 2014 5:55 PM
> >> To: java-user
> >> Subject: Re: manually merging Directories
> >>
> >> I doubt this is going to work. I have to ask why you're worried about
> >> the I/O; this smacks of premature optimization. Not only do the files
> >> have to be moved, but the right control structures need to be in
> >> place to inform Solr (well, Lucene) exactly what files are current.
> >> There's a lot of room for programming errors here....
> >>
> >> segments_n is the file that tells Lucene which segments are active.
> >> There can only be one that's active so you'd have to somehow combine
> them all.
> >>
> >> I think this is a dubious proposition at best, all to avoid some I/O.
> >> How much I/O are we talking here? If it's a huge amount, I'm not at
> >> all sure you'll be able to _use_ your merged index.
> >> How many docs are we talking about? 100M? 10B? I mean you used M/R
> on
> >> it in the first place for a reason....
> >>
> >> But this is what the --go-live option of the MapReduceIndexerTool
> >> already does for you. Admittedly, it copies things around the network
> >> to the final destination, personally I'd just use that.
> >>
> >> As you can tell, I don't know all the details to say it's impossible,
> >> IMO this is feels like wasted effort with lots of possibilities to
> >> get wrong for little demonstrated benefit. You'd spend a lot more
> >> time trying to figure out the correct thing to do and then fixing
> >> bugs than you'll spend waiting for the copy HDFS or no.
> >>
> >> Best,
> >> Erick
> >>
> >> On Tue, Dec 23, 2014 at 2:55 PM, Shaun Senecal
> >> <shaun.senecal@lithium.com> wrote:
> >> > Hi
> >> >
> >> > I have a number of Directories which are stored in various paths on
> >> > HDFS,
> >> and I would like to merge them into a single index.  The obvious way
> >> to do this is to use IndexWriter.addIndexes(...), however, I'm hoping
> >> I can do better.  Since I have created each of the separate indexes
> >> using Map/Reduce, I know that there are no deleted or duplicate
> >> documents and the codecs are the same.  Using addIndexes(...) will
> >> incur a lot of I/O as it copies from the source Directory into the
> >> dest Directory, and this is the bit I would like to avoid.  Would it
> >> instead be possible to simply move each of the segments from each
> >> path into a single path on HDFS using a mv/rename operation instead?
> >> Obviously I would need to take care of the naming to ensure the files
> >> from one index dont overwrite another's, but it looks like this is
> >> done with a counter of some sort so that the latest segment can be
> >> found. A potential complication is the segments_1 file, as I'm not sure
> what that is for or if I can easily (re)construct it externally.
> >> >
> >> > The end goal here is to index using Map/Reduce and then spit out a
> >> > single
> >> index in the end that has been merged down to a single segment, and
> >> to minimize IO while doing it.  Once I have the completed index in a
> >> single Directory, I can (optionally) perform the forced merge (which
> >> will incur a huge IO hit).  If the forced merge isnt performed on
> >> HDFS, it could be done on the search nodes before the active searcher
> >> is switched.  This may be better if, for example, you know all of
> >> your search nodes have SSDs and IO to spare.?
> >> >
> >> > Just in case my explanation above wasn't clear enough, here is a
> >> > picture
> >> >
> >> > What I have:
> >> >
> >> > /user/username/MR_output/0
> >> >   _0.fdt
> >> >   _0.fdx
> >> >   _0.fnm
> >> >   _0.si
> >> >   ...
> >> >   segments_1
> >> >
> >> > /user/username/MR_output/1
> >> >   _0.fdt
> >> >   _0.fdx
> >> >   _0.fnm
> >> >   _0.si
> >> >   ...
> >> >   segments_1
> >> >
> >> >
> >> > What I want (using simple mv/rename):
> >> >
> >> > /user/username/merged
> >> >   _0.fdt
> >> >   _0.fdx
> >> >   _0.fnm
> >> >   _0.si
> >> >   ...
> >> >   _1.fdt
> >> >   _1.fdx
> >> >   _1.fnm
> >> >   _1.si
> >> >   ...
> >> >   segments_1
> >> >
> >> >
> >> >
> >> >
> >> > Thanks,
> >> >
> >> > Shaun?
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message