lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Staveley (Tom)" <>
Subject RE: Compound / non-compound index files and SIGKILL
Date Tue, 06 Jun 2006 11:41:02 GMT
This is a good idea. I had been worried about the additional heap
requirement maintaining a queue, without being able to serialize/deserialize
Documents (i.e. a build up of Lucene Documents in RAM). I have been
marshalling addDocument() calls using a synchronized object; the same
threads have been taking responsibility for creating Documents
(unsynchronized) and adding them to the index writer (synchronized). I guess
I could have a one Document queue feeding a single addDocument thread, which
would effectively be the same approach, but which would make it easier to
ensure that only the create Document thread is killed when I get a SIGTERM
and the addDocument thread is left to run its course (assuming it hasn't

Having said that, I'm not sure what I could do in a shutdown hook, which
wouldn't already have been done by a SIGTERM to get the hanged thread to
terminate. The reason for SIGKILL was that the daemon wouldn't be killed by
SIGTERM. I guess I'd feel more confident about using SIGKILL, if I knew that
the uninterruptible hanged thread was creating a Document, which I could
interrupt without corrupting the index, rather than adding the document to
the index, which is liable to result in orphaned files and/or a corrupted
index, if killed.

-----Original Message-----
From: Volodymyr Bychkoviak [] 
Sent: 06 June 2006 10:54
Subject: Re: Compound / non-compound index files and SIGKILL

If your content handlers should respond quickly then you should move
indexing process to separate thread and maintain items in queue.

Rob Staveley (Tom) wrote:
> This is a real eye-opener, Volodymyr. Many thanks. I guess that means 
> that my orphan-producing hangs must be addDocument() calls, and not in 
> the content handlers, as I'd previously assumed. I'll put some debug 
> before and after my addDocument() calls to confirm (and point my 
> writer's infoStream to System.out).
> -----Original Message-----
> From: Volodymyr Bychkoviak []
> Sent: 05 June 2006 18:33
> To:
> Subject: Re: Compound / non-compound index files and SIGKILL
> Hi.
> My five cents :)
> It might be helpful to know how lucene is working with compound files. 
> When segment is flushed to disk it is written uncompound and after 
> that is merged into single .cfs file. If you don't change default 
> setting for using compound files (which is on) this is only place (I 
> guess) for these files to appear.
> If you're working with large indexes, than merging segments can take a 
> while (Maybe here is your problem? :) ) (merging happens on
> addDocument() call).  If you will kill indexing process during such 
> merge you'll get many orphaned files...
> You can just run optimize on this index. You'll get three files: 
> segments, deletable, <segment>.cfs; you can look name of segment in 
> 'segments' file. Everything else is 'garbage' - you can delete it.
> Rob Staveley (Tom) wrote:
>> I've been indexing live data into a compound index from an MTA. I'm 
>> resolving a bunch of problems unrelated to Lucene (disparate hangs in 
>> my content handlers). When I get a hang, I typically need to kill my 
>> daemon, alas more often than not using kill -9 (SIGKILL).
>> However, these SIGKILLs are leaving large temporary(?) files, which I
> guess
>> are non-compound index files transiently extracted from the working 
>> .cfs
>> files:
>> -rw-r--r--    1  373138432 Jun  2 13:42 _18hup.fdt
>> -rw-r--r--    1      5054464 Jun  2 13:42 _18hup.fdx
>> -rw-r--r--    1              426 Jun  2 13:42 _18hup.fnm
>> -rw-r--r--    1  457253888 Jun  2 09:22 _15djq.fdt
>> -rw-r--r--    1      6205440 Jun  2 09:22 _15djq.fdx
>> -rw-r--r--    1              426 Jun  2 09:21 _15djq.fnm
>> They are left intact after restarting my daemon. Presumably they are 
>> not treated as being part of the compound index. I see no 
>> corresponding .cfs file for them.
>> As a consequence of these - I suspect - I am getting a very large 
>> overall disk requirement for my index, presumably because of 
>> replicated field
> data.
>> My guess is that the field data in the orphaned .fdt files needs to 
>> be regenerated.
>> In another index directory from a previous test run (again with 
> I
>> have 98 GB of index files, with only 12 BG devoted to compound files 
>> for
> the
>> field index (.cfs). The rest of the disk space is used by orphaned 
>> uncompounded index files; I see 51 GB devoted to uncompounded field 
>> data (.fdt), 13 BG devoted to term positions (.prx) and 13 BG devoted 
>> to term frequencies (.frq).
>> Here's my question:
>> How can I attempt to merge these orphaned into the compound index, 
>> using IndexWriter.addIndexes(), or would I be foolish attempting this?

Volodymyr Bychkoviak

View raw message