incubator-lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From goran kent <gorank...@gmail.com>
Subject [lucy-user] Merging indexes efficiently
Date Wed, 14 Sep 2011 06:37:59 GMT
Hello,

Accessing http://incubator.apache.org/lucy/docs/perl/ is spotty at the
moment, so I'm reading the man pages...

I take it $index->add_index($other_index) is the method to merge
multiple indexes?

I'm thinking of the most efficient way to merge a batch of thousands of indexes:

pseudo-code:

# TRY1 - hell for leather
$master_index = Lucy::Index::Indexer->new...
foreach $sub_index (...) {
    $master_index->add_index($sub_index);
}
$master_index->commit;

Now, I imagine this is no problem for a handful of sub_indexes, but
what are the risks when this involves thousands?  Are there any kind
of limitations or pitfalls I should be aware of when doing this?

# TRY2 - tippy-toe
$cnt=0;
$MAX=1000;
foreach $sub_index (...) {
    $master_index->add_index($sub_index);
    if ($cnt++ > $MAX) { $cnt=0; $master_index->commit();
$master_index = Lucy::Index::Indexer->new($master_index,
create=>0,truncate=>0...; }
}
$master_index->commit unless $already_committed;

or,

# TRY3 - depending on whether I grok prepare_commit()
$cnt=0;
foreach $sub_index (...) {
    $master_index->add_index($sub_index);
    if ($cnt++ > $MAX) { $cnt=0; $master_index->prepare_commit(); }
}
$master_index->commit;


The question is also what's the most efficient $MAX (I imagine it
depends on RAM if stuff is kept therein before a commit)...  or should
I not overcomplicate things and simply allow Lucy to worry about the
internals and gun for TRY1?  TRY2 allows me an opportunity to check
on-disk $master_index size after a commit (are the buffers flushed
after a commit and things committed to disk so a qx(du -sh $master)
reflects actual size?),...  I lean towards TRY2, or will TRY3 also
commit to disk?...

Comments?

Thanks

Mime
View raw message