lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: manually merging Directories
Date Wed, 24 Dec 2014 01:55:03 GMT
I doubt this is going to work. I have to ask why you're
worried about the I/O; this smacks of premature
optimization. Not only do the files have to be moved, but
the right control structures need to be in place to inform
Solr (well, Lucene) exactly what files are current. There's
a lot of room for programming errors here....

segments_n is the file that tells Lucene which segments
are active. There can only be one that's active so you'd have
to somehow combine them all.

I think this is a dubious proposition at best, all to avoid some
I/O. How much I/O are we talking here? If it's a huge amount,
I'm not at all sure you'll be able to _use_ your merged index.
How many docs are we talking about? 100M? 10B? I mean
you used M/R on it in the first place for a reason....

But this is what the --go-live option of the MapReduceIndexerTool
already does for you. Admittedly, it copies things around the
network to the final destination, personally I'd just use that.

As you can tell, I don't know all the details to say it's impossible,
IMO this is feels like wasted effort with lots of possibilities to
get wrong for little demonstrated benefit. You'd spend a lot more
time trying to figure out the correct thing to do and then fixing
bugs than you'll spend waiting for the copy HDFS or no.

Best,
Erick

On Tue, Dec 23, 2014 at 2:55 PM, Shaun Senecal
<shaun.senecal@lithium.com> wrote:
> Hi
>
> I have a number of Directories which are stored in various paths on HDFS, and I would
like to merge them into a single index.  The obvious way to do this is to use IndexWriter.addIndexes(...),
however, I'm hoping I can do better.  Since I have created each of the separate indexes using
Map/Reduce, I know that there are no deleted or duplicate documents and the codecs are the
same.  Using addIndexes(...) will incur a lot of I/O as it copies from the source Directory
into the dest Directory, and this is the bit I would like to avoid.  Would it instead be possible
to simply move each of the segments from each path into a single path on HDFS using a mv/rename
operation instead?  Obviously I would need to take care of the naming to ensure the files
from one index dont overwrite another's, but it looks like this is done with a counter of
some sort so that the latest segment can be found. A potential complication is the segments_1
file, as I'm not sure what that is for or if I can easily (re)construct it externally.
>
> The end goal here is to index using Map/Reduce and then spit out a single index in the
end that has been merged down to a single segment, and to minimize IO while doing it.  Once
I have the completed index in a single Directory, I can (optionally) perform the forced merge
(which will incur a huge IO hit).  If the forced merge isnt performed on HDFS, it could be
done on the search nodes before the active searcher is switched.  This may be better if, for
example, you know all of your search nodes have SSDs and IO to spare.?
>
> Just in case my explanation above wasn't clear enough, here is a picture
>
> What I have:
>
> /user/username/MR_output/0
>   _0.fdt
>   _0.fdx
>   _0.fnm
>   _0.si
>   ...
>   segments_1
>
> /user/username/MR_output/1
>   _0.fdt
>   _0.fdx
>   _0.fnm
>   _0.si
>   ...
>   segments_1
>
>
> What I want (using simple mv/rename):
>
> /user/username/merged
>   _0.fdt
>   _0.fdx
>   _0.fnm
>   _0.si
>   ...
>   _1.fdt
>   _1.fdx
>   _1.fnm
>   _1.si
>   ...
>   segments_1
>
>
>
>
> Thanks,
>
> Shaun?
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message