lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shaun Senecal <shaun.sene...@lithium.com>
Subject Re: manually merging Directories
Date Mon, 29 Dec 2014 23:37:02 GMT
Hi Mike

That's actually what I was looking at doing, I was just hoping there was a way to avoid the
"copySegmentAsIs" step and simply replace it with a "rename" operation on the file system.
 It seemed like low hanging fruit, but Uwe and Erick have now told me that the segments have
dependencies embedded in them somehow, so a simple rename operation wouldn't accomplish the
same thing.  In the end, it may not be a big deal anyway.


Thanks

Shaun


________________________________________
From: Michael McCandless <lucene@mikemccandless.com>
Sent: December 29, 2014 2:43 PM
To: Lucene Users
Subject: Re: manually merging Directories

Why not use IW.addIndexes(Directory[])?

Mike McCandless

http://blog.mikemccandless.com


On Mon, Dec 29, 2014 at 12:44 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> Hi,
>
> Why not simply leave each index directory on the searcher nodes as is:
> Move all index directories (as mentioned by you) to a local disk and access them using
a MultiReader - there is no need to merge them if you have not enough resources. If you have
enough CPU and IO power, just merge them as usual with IndexWriter.addIndexes(). But I don't
understand you argument with I/O: If you copy the index files from HDFS to local disks already,
how can this work without I/O? So you can merge them anyways.
>
> Merging index files, simply by copying them all in one directory, is impossible, because
the files reference each other by segment name (segments_n refers to them, also the segment
ids are used all over). So You would need to change some index files already for merge to
make the SegmentInfos structures use the correct names, so you can do a real merge anyways.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: Shaun Senecal [mailto:shaun.senecal@lithium.com]
>> Sent: Monday, December 29, 2014 6:34 PM
>> To: java-user
>> Subject: Re: manually merging Directories
>>
>> I'm not worried about the I/O right now, I'm "hoping I can do better", that's
>> all.  It sounds like the only actual complication here is building the
>> segments_N file, which would list all of the newly renamed segments, so
>> perhaps this isn't impossible.  That said, you're absolutely right about the
>> possibility of complications, so its debatable if doing something like this
>> would be worth it in the end.  Thanks for the info
>>
>>
>>
>> Shaun
>>
>>
>> ________________________________________
>> From: Erick Erickson <erickerickson@gmail.com>
>> Sent: December 23, 2014 5:55 PM
>> To: java-user
>> Subject: Re: manually merging Directories
>>
>> I doubt this is going to work. I have to ask why you're worried about the I/O;
>> this smacks of premature optimization. Not only do the files have to be
>> moved, but the right control structures need to be in place to inform Solr
>> (well, Lucene) exactly what files are current. There's a lot of room for
>> programming errors here....
>>
>> segments_n is the file that tells Lucene which segments are active. There can
>> only be one that's active so you'd have to somehow combine them all.
>>
>> I think this is a dubious proposition at best, all to avoid some I/O. How much
>> I/O are we talking here? If it's a huge amount, I'm not at all sure you'll be able
>> to _use_ your merged index.
>> How many docs are we talking about? 100M? 10B? I mean you used M/R on it
>> in the first place for a reason....
>>
>> But this is what the --go-live option of the MapReduceIndexerTool already
>> does for you. Admittedly, it copies things around the network to the final
>> destination, personally I'd just use that.
>>
>> As you can tell, I don't know all the details to say it's impossible, IMO this is
>> feels like wasted effort with lots of possibilities to get wrong for little
>> demonstrated benefit. You'd spend a lot more time trying to figure out the
>> correct thing to do and then fixing bugs than you'll spend waiting for the copy
>> HDFS or no.
>>
>> Best,
>> Erick
>>
>> On Tue, Dec 23, 2014 at 2:55 PM, Shaun Senecal
>> <shaun.senecal@lithium.com> wrote:
>> > Hi
>> >
>> > I have a number of Directories which are stored in various paths on HDFS,
>> and I would like to merge them into a single index.  The obvious way to do
>> this is to use IndexWriter.addIndexes(...), however, I'm hoping I can do
>> better.  Since I have created each of the separate indexes using
>> Map/Reduce, I know that there are no deleted or duplicate documents and
>> the codecs are the same.  Using addIndexes(...) will incur a lot of I/O as it
>> copies from the source Directory into the dest Directory, and this is the bit I
>> would like to avoid.  Would it instead be possible to simply move each of the
>> segments from each path into a single path on HDFS using a mv/rename
>> operation instead?  Obviously I would need to take care of the naming to
>> ensure the files from one index dont overwrite another's, but it looks like
>> this is done with a counter of some sort so that the latest segment can be
>> found. A potential complication is the segments_1 file, as I'm not sure what
>> that is for or if I can easily (re)construct it externally.
>> >
>> > The end goal here is to index using Map/Reduce and then spit out a single
>> index in the end that has been merged down to a single segment, and to
>> minimize IO while doing it.  Once I have the completed index in a single
>> Directory, I can (optionally) perform the forced merge (which will incur a
>> huge IO hit).  If the forced merge isnt performed on HDFS, it could be done
>> on the search nodes before the active searcher is switched.  This may be
>> better if, for example, you know all of your search nodes have SSDs and IO to
>> spare.?
>> >
>> > Just in case my explanation above wasn't clear enough, here is a
>> > picture
>> >
>> > What I have:
>> >
>> > /user/username/MR_output/0
>> >   _0.fdt
>> >   _0.fdx
>> >   _0.fnm
>> >   _0.si
>> >   ...
>> >   segments_1
>> >
>> > /user/username/MR_output/1
>> >   _0.fdt
>> >   _0.fdx
>> >   _0.fnm
>> >   _0.si
>> >   ...
>> >   segments_1
>> >
>> >
>> > What I want (using simple mv/rename):
>> >
>> > /user/username/merged
>> >   _0.fdt
>> >   _0.fdx
>> >   _0.fnm
>> >   _0.si
>> >   ...
>> >   _1.fdt
>> >   _1.fdx
>> >   _1.fnm
>> >   _1.si
>> >   ...
>> >   segments_1
>> >
>> >
>> >
>> >
>> > Thanks,
>> >
>> > Shaun?
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message