lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gupta, Rajiv" <Rajiv.Gu...@netapp.com>
Subject [lucy-user] Speed up Search with Lucy::Search::IndexSearcher and Lucy::Search::PolySearcher from multiple index folders
Date Wed, 14 Sep 2016 07:05:12 GMT
I'm creating indexes on multiple subfolders under one parent folder.

Indexes are created on multiple folders since files are getting created in parallel and I
want to avoid segment locking between multiple indexers.

One of my applications creates the directory structure with lots of log files within different
subfolders.

I'm indexing all those files in parallel as and when they are created.

The directory structure looks like this:
TopDir/00_log.log
      /01_log2.log
      /.lucyindexer/1/seg_1
                     /seg_2
      /03_log3.log
      /03_log3/log31.log
              /log32.log
              /.lucyindexer/1/seg_1
                             /seg 2
              /log32/log321.log
                    /log322.log
                    /.lucyindexer/1/seg_1
                                   /seg_2
                                 /2/seg_1



This works fine, and while my application is running all log files get indexed as well.
Search is a different application which does following:
1.    Scan through all the directories till .lucyindexer/1 and create a list of all such folders.
I use File::Find<https://metacpan.org/pod/File::Find> to do that.
2.    Create searchers using Lucy::Search::IndexSearcher<https://metacpan.org/pod/Lucy::Search::IndexSearcher>
in loop and add all the searchers to Lucy::Search::PolySearcher<https://metacpan.org/pod/Lucy::Search::PolySearcher>

My code looks like this:


my $schema;



for my $index ( @all_dirs ) {

    chomp $index;

    my $indexer = Lucy::Search::IndexSearcher->new( index => $index );

    push @searchers, $indexer;

    $schema = $indexer->get_schema;

}



# Poly server is the only way to get all search results combined.

my $poly_searcher = Lucy::Search::PolySearcher->new(

    schema    => $schema,

    searchers => \@searchers,

);



my $query_parser = Lucy::Search::QueryParser->new(

    schema => $poly_searcher->get_schema,

    fields => ['title'],

);



# Build up a Query.

my $q = "1 2 3 4 5 6 7 11 12 13 14 18";



my $query = $query_parser->parse( $q );



# Execute the Query and get a Hits object.

my $hits = $poly_searcher->hits(

    query      => $query,

    num_wanted => -1,       # -1 equivalent to all results



    # sort_spec => $sort_spec,

);



while ( my $hit = $hits->next ) {



    ## Do some operation

}


This runs and returns the expected results. However, the performance is really bad when the
directory structure is deeply nested.
I did profiling using Devel::NYTProf<https://metacpan.org/pod/Devel::NYTProf> and found
two places where the maximum time was taken:
1.    While scanning the directory. (This I will try to solve by generating a list of directories
while the application is generating the indexes).
2.    When creating the searchers using Lucy::Search::IndexSearcher. This takes maximum time
when running in loop for all indexed directories.
To solve the item #2 I tried to generate a Lucy::Search::IndexSearcher object for different
index folders using Parallel::ForkManager<https://metacpan.org/pod/Parallel::ForkManager>
but I got the following error:
The storable module was unable to store the child's data structure to the temp file "/tmp/Parallel-ForkManager-27339-27366.txt":
Storable serialization not implemented for Lucy::Search::IndexSearcher at /usr/software/lib/perl5/site_perl/5.14.0/x86_64-linux-thread-multi/Clownfish.pm
line 93
Using following code:
my $pm = new Parallel::ForkManager( $max_procs );

$pm->run_on_finish(
    sub {
        my ( $pid, $exit_code, $ident, $exit_signal, $core_dump, $index ) = @_;
        print Dumper $index;
        push( @searchers, $index );
    }
);

for my $index ( @all_dirs ) {
    chomp $index;
    my $forkpid = $pm->start( $index ) and next;    #fork
    my $indexer = Lucy::Search::IndexSearcher->new( index => $index );
    $pm->finish( 0, \$indexer );
}

$pm->wait_all_children;
This whole process takes up to 60-120 seconds for a large log directory. At the end of the
whole process I create a nested JSON object from all search results to display using JQuery.
I'm looking for ideas to improve its performance. Any idea how to create multiple searchers
using Parallel::ForkManager or any other method? Or some other way to improve the search performance.
Also, is there any way I can merge all the indexes in one place?
Thanks,
Rajiv Gupta

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message