lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gupta, Rajiv" <Rajiv.Gu...@netapp.com>
Subject RE: [lucy-user] LUCY_Folder_Open_Out_IMP at core/Lucy/Store/Folder.c line 119
Date Wed, 07 Dec 2016 12:23:00 GMT
Thanks Nick for your reply. Thanks Peter too. 

This looks like two processes are writing to the index at once. This shouldn't happen unless
something with our locking mechanism is broken. Do you have an unusual setup? Are you perhaps
running on NFS?

Yes, I have an unusual setup. Let me try to explain the setup. 

* My application is a test application. That runs too many test cases in parallel, which generates
lot of log files. I'm using Lucy to index those log files for faster search, pagination and
generating summary. 
* From my application I kickoff a lucyindexer script using Open3 which is primarily responsible
for indexing all the files while tests are progressing. The output & error of lucyindexer
goes to STDOUT that is redirected to a log file. 
* My application generates log files from 4 different sources. The information of all the
log files that are newly created and end of files are stored in 4 different tables in our
database.
* In my lucyindexer main module I use EV watchers. To monitor the tables I use EV::periodic
(5sec) for new entries and completion of file (10sec), and EV:stat 1sec for file changes (however
is just like periodic since that EV::stat won't work on NFS) and EV::IO to check the broken
pipe so that I exit from indexer process once my test application ends. 
*With each watcher when I get a new log files it follows following workflow. Scan through
the file with very limited keywords, doing file open and reading line by line and create a
Lucy doc base of defined regular expression. If it got the end time from db then insert another
special doc end marker indicating the end of the file. That file gets removed from my list
after adding end marker. End marker also stores the last line number and last seek pointer.
 If no end time got for that file then it keeps 1 sec stat for new changes and add Lucy docs
incrementally. With every new next file in I use Lucy Search to search if that file was opened
previously, if I found that file name then I get its last line number and the seek pointer
from the end marker. I delete that doc (end marker) using Lucy indexer delete by query and
start reading that file for further changes. Once that file is aging closed I again insert
the end marker. Once I get the pipe broken from my parent test application i keep buffer of
2 mins to insert end marker for in process  files.
* The index directory for all these files is under same folder name with name .lucyindexer/1
(I fixed it). There are multiple files in the same folder but it is rare (I never see it)
that they conflicts in creating docs. Why I'm saying it is because one version of application
is already out which is generating the docs however, it has a problem that when same file
opens again it re-index it full which takes time and creates duplicates. That is the reason
I tried to insert Search before adding doc for those files. I can also keep them in memory
but since sometime list of file goes in 100k (for long running tests) the system get out of
memory and become very slow.
* Indexer and log files are on NFS mount.
* I also observed that EV sometime getting premature ends (without calling break) but I'm
not sure it is because of Indexer error. That time there is no error reported.
* In my Viewer application I run Forked LucySearch to consolidate data from all the folders.
The list of folders sometime goes 1000. I used polysearcher but not found it faster than fork.


My Lucy library version is 0.4.2 I have asked my infra team to upgrade, which may take a month
of so.  

Here is what happening in parallel most of the time.

Search->Found->delete doc->add doc->commit
add doc->commit

Thanks for reading till here. I'm open for any suggestions. I really liked this framework
and see big opportunity in my company internal triaging strategy with linking it with product
logs for more effective results. 

You guys rock!

Thanks,
Rajiv Gupta

-----Original Message-----
From: Nick Wellnhofer [mailto:wellnhofer@aevum.de] 
Sent: Wednesday, December 07, 2016 4:46 PM
To: user@lucy.apache.org
Subject: Re: [lucy-user] LUCY_Folder_Open_Out_IMP at core/Lucy/Store/Folder.c line 119

On 06/12/2016 17:17, Gupta, Rajiv wrote:
> Any idea why I'm getting this error.
>
> Error Invalid path: 'seg_9i/lextemp'
> 20161205 184114 [] [event_check_for_logfile_completion_in_db][FAILED at DB Query to check
logfile completion][Error Invalid path: 'seg_9i/lextemp'
> 20161205 184114 []  LUCY_Folder_Open_Out_IMP at 
> core/Lucy/Store/Folder.c line 119
> 20161205 184114 []  S_lazy_init at core/Lucy/Index/PostingListWriter.c 
> line 92
>
>
> In another log file getting different error
>
> Error rename from '<Dir>/.lucyindex/1/schema.temp' to '<Dir> 
> /.lucyindex/1/schema_an.json' failed: Invalid argument
> 20161205 174146 []  LUCY_Schema_Write_IMP at core/Lucy/Plan/Schema.c 
> line 429
>
> When committing the indexer object.

This looks like two processes are writing to the index at once. This shouldn't happen unless
something with our locking mechanism is broken. Do you have an unusual setup? Are you perhaps
running on NFS?

> In both the case I'm seeing one common pattern that time is getting skewed up in the
STDOUT log file by 5-6 hrs before starting the process on this file. In actual system time
is not changed.

I don't fully understand this paragraph. Can you clarify?

Nick


Mime
View raw message