lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Volodymyr Bychkoviak <vbychkov...@i-hypergrid.com>
Subject Re: Compound / non-compound index files and SIGKILL
Date Tue, 06 Jun 2006 09:53:30 GMT
If your content handlers should respond quickly then you should move 
indexing process to separate thread and maintain items in queue.

Rob Staveley (Tom) wrote:
> This is a real eye-opener, Volodymyr. Many thanks. I guess that means that
> my orphan-producing hangs must be addDocument() calls, and not in the
> content handlers, as I'd previously assumed. I'll put some debug before and
> after my addDocument() calls to confirm (and point my writer's infoStream to
> System.out).
>
> -----Original Message-----
> From: Volodymyr Bychkoviak [mailto:vbychkoviak@i-hypergrid.com] 
> Sent: 05 June 2006 18:33
> To: java-user@lucene.apache.org
> Subject: Re: Compound / non-compound index files and SIGKILL
>
> Hi.
> My five cents :)
>
> It might be helpful to know how lucene is working with compound files. 
> When segment is flushed to disk it is written uncompound and after that is
> merged into single .cfs file. If you don't change default setting for using
> compound files (which is on) this is only place (I guess) for these files to
> appear.
>
> If you're working with large indexes, than merging segments can take a while
> (Maybe here is your problem? :) ) (merging happens on
> addDocument() call).  If you will kill indexing process during such merge
> you'll get many orphaned files...
>
> You can just run optimize on this index. You'll get three files: 
> segments, deletable, <segment>.cfs; you can look name of segment in
> 'segments' file. Everything else is 'garbage' - you can delete it.
>
>
> Rob Staveley (Tom) wrote:
>   
>> I've been indexing live data into a compound index from an MTA. I'm
>> resolving a bunch of problems unrelated to Lucene (disparate hangs in my
>> content handlers). When I get a hang, I typically need to kill my daemon,
>> alas more often than not using kill -9 (SIGKILL).
>>
>> However, these SIGKILLs are leaving large temporary(?) files, which I
>>     
> guess
>   
>> are non-compound index files transiently extracted from the working .cfs
>> files:
>>
>> -rw-r--r--    1  373138432 Jun  2 13:42 _18hup.fdt
>> -rw-r--r--    1      5054464 Jun  2 13:42 _18hup.fdx
>> -rw-r--r--    1              426 Jun  2 13:42 _18hup.fnm
>>
>> -rw-r--r--    1  457253888 Jun  2 09:22 _15djq.fdt
>> -rw-r--r--    1      6205440 Jun  2 09:22 _15djq.fdx
>> -rw-r--r--    1              426 Jun  2 09:21 _15djq.fnm
>>
>> They are left intact after restarting my daemon. Presumably they are not
>> treated as being part of the compound index. I see no corresponding .cfs
>> file for them. 
>>
>> As a consequence of these - I suspect - I am getting a very large overall
>> disk requirement for my index, presumably because of replicated field
>>     
> data.
>   
>> My guess is that the field data in the orphaned .fdt files needs to be
>> regenerated.
>>
>> In another index directory from a previous test run (again with SIGKILLs),
>>     
> I
>   
>> have 98 GB of index files, with only 12 BG devoted to compound files for
>>     
> the
>   
>> field index (.cfs). The rest of the disk space is used by orphaned
>> uncompounded index files; I see 51 GB devoted to uncompounded field data
>> (.fdt), 13 BG devoted to term positions (.prx) and 13 BG devoted to term
>> frequencies (.frq).
>>
>> Here's my question:
>>
>> How can I attempt to merge these orphaned into the compound index, using
>> IndexWriter.addIndexes(), or would I be foolish attempting this?
>>   
>>     
>
>   

-- 
regards,
Volodymyr Bychkoviak


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message