Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 92574 invoked from network); 7 Jun 2006 19:18:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 7 Jun 2006 19:18:01 -0000 Received: (qmail 11086 invoked by uid 500); 7 Jun 2006 19:17:46 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 10896 invoked by uid 500); 7 Jun 2006 19:17:45 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 10857 invoked by uid 99); 7 Jun 2006 19:17:44 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Jun 2006 12:17:44 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [169.229.70.167] (HELO rescomp.berkeley.edu) (169.229.70.167) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Jun 2006 12:17:41 -0700 Received: by rescomp.berkeley.edu (Postfix, from userid 1007) id 513E95B779; Wed, 7 Jun 2006 12:17:12 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by rescomp.berkeley.edu (Postfix) with ESMTP id 3BFBD7F403 for ; Wed, 7 Jun 2006 12:17:12 -0700 (PDT) Date: Wed, 7 Jun 2006 12:17:12 -0700 (PDT) From: Chris Hostetter To: java-user@lucene.apache.org Subject: RE: Compound / non-compound index files and SIGKILL In-Reply-To: <20060607090428.B841E187B1@mail.seseit.com> Message-ID: References: <20060607090428.B841E187B1@mail.seseit.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N : However, I'm not sure what to make of: : --------8<-------- : Thread 3740: (state = BLOCKED) : - java.lang.Object.wait(long) @bci=0 (Interpreted frame) : - java.lang.Object.wait() @bci=2, line=474 (Compiled frame) : Error occurred during stack walking: : java.lang.NullPointerException : at sun.jvm.hotspot.runtime.Frame.addressOfStackSlot(Frame.java:214) crap. i don't know what that means, but it certainly looks bad ... if the JVM can't make sense of your stack, i can't imagein how there would be any hope of your program recovering. : Does that mean that the PipedReader in the following might persist beyond : the scope of this code, and be read from only when the Lucene document is : added to the index? I'm not sure what exactly your process method is doing, but it is certainly true that Lucene won't do anything to read from that PipedReader in this code... all the new Field constructor willdo is hand on to a refrence to that reader, and all the Document.add call will do is hang on to a refrence to that Field. It isn't until you add that document using IndexWriter.addDocument that anything is read from that Reader. That's acctually teh whole point of makinga Field from a Reader -- if you're slurping a big chunk of data off disk, or a network connectin or something, giving lucene a Reader allows you to pipeline the data all the way to the Analyzer without ever needinga fully copy of it in memory. Which means if the thing producing your data and streaming it to that PipedReader chokes and causes your app to crash, it will crash in the middle of of writing out your Document. try buffering all the data for each doc in memory as a String and building a Field with that -- it may not prevent your app from crashing (that soudns like a problem somewhere else in your code) but it may prevent your index from getting corrupt when it does crash. I say it "may" prevent it -- because if you are multithreading your app then there's no garuntee that one thread won't cause acrash while another thread is in the middle of writting data to your index. : : --------8<-------- : final PipedWriter pw = new PipedWriter(); : Thread t = new Thread() { : public void run() { : try { : // Index the body text in the Lucene : document, : // but do not store it : doc.add( : new Field( : "body" : ,new PipedReader(pw) : ) : ); : } : catch (IOException e) { : e.printStackTrace(); : } : } : }; : t.start(); : : // Process an input stream for the content handler. : // Tokens extracted from the stream are written to : // the piped writer and hence made available to the : // PipedReader used in the "body" field constructor. : process(is,pw); : : // Close the output stream to get the PipedReader to see EOF : pw.close(); : : // Join the thread to wait for the field to be added to the : // document : t.join(); : : // Now go on to add other fields (metadata), and then add : // the document to the index... : --------8<-------- : : : -----Original Message----- : From: Chris Hostetter [mailto:hossman_lucene@fucit.org] : Sent: 06 June 2006 20:13 : To: java-user@lucene.apache.org : Subject: RE: Compound / non-compound index files and SIGKILL : : : 1) have you tried forcing a threaddump of the JVM when it hangs to see what : it's doing? (i don't remember which signal it is off the top of my head, : but even if it's not responding to SIGTERM it might respond to that) : : : SIGTERM. I guess I'd feel more confident about using SIGKILL, if I knew : that : : the uninterruptible hanged thread was creating a Document, which I could : : interrupt without corrupting the index, rather than adding the document to : : the index, which is liable to result in orphaned files and/or a corrupted : : index, if killed. : : 2) It's possible that the thread is doing both (creating and adding) at the : same time ... if you are Constructing documents using Fields that contain : Readers you get back from convertors which stream data from complex : documents as needed, then DocumentWriter may have started to write your : document, gotten to a Field with a Reader, and then your convertor may be : choking on something within the source document while it tries to stream : data to that Reader. : : : ...just a theory. : : : : -----Original Message----- : : From: Volodymyr Bychkoviak [mailto:vbychkoviak@i-hypergrid.com] : : Sent: 06 June 2006 10:54 : : To: java-user@lucene.apache.org : : Subject: Re: Compound / non-compound index files and SIGKILL : : : : If your content handlers should respond quickly then you should move : : indexing process to separate thread and maintain items in queue. : : : : Rob Staveley (Tom) wrote: : : > This is a real eye-opener, Volodymyr. Many thanks. I guess that means : : > that my orphan-producing hangs must be addDocument() calls, and not in : : > the content handlers, as I'd previously assumed. I'll put some debug : : > before and after my addDocument() calls to confirm (and point my : : > writer's infoStream to System.out). : : > : : > -----Original Message----- : : > From: Volodymyr Bychkoviak [mailto:vbychkoviak@i-hypergrid.com] : : > Sent: 05 June 2006 18:33 : : > To: java-user@lucene.apache.org : : > Subject: Re: Compound / non-compound index files and SIGKILL : : > : : > Hi. : : > My five cents :) : : > : : > It might be helpful to know how lucene is working with compound files. : : > When segment is flushed to disk it is written uncompound and after : : > that is merged into single .cfs file. If you don't change default : : > setting for using compound files (which is on) this is only place (I : : > guess) for these files to appear. : : > : : > If you're working with large indexes, than merging segments can take a : : > while (Maybe here is your problem? :) ) (merging happens on : : > addDocument() call). If you will kill indexing process during such : : > merge you'll get many orphaned files... : : > : : > You can just run optimize on this index. You'll get three files: : : > segments, deletable, .cfs; you can look name of segment in : : > 'segments' file. Everything else is 'garbage' - you can delete it. : : > : : > : : > Rob Staveley (Tom) wrote: : : > : : >> I've been indexing live data into a compound index from an MTA. I'm : : >> resolving a bunch of problems unrelated to Lucene (disparate hangs in : : >> my content handlers). When I get a hang, I typically need to kill my : : >> daemon, alas more often than not using kill -9 (SIGKILL). : : >> : : >> However, these SIGKILLs are leaving large temporary(?) files, which I : : >> : : > guess : : > : : >> are non-compound index files transiently extracted from the working : : >> .cfs : : >> files: : : >> : : >> -rw-r--r-- 1 373138432 Jun 2 13:42 _18hup.fdt : : >> -rw-r--r-- 1 5054464 Jun 2 13:42 _18hup.fdx : : >> -rw-r--r-- 1 426 Jun 2 13:42 _18hup.fnm : : >> : : >> -rw-r--r-- 1 457253888 Jun 2 09:22 _15djq.fdt : : >> -rw-r--r-- 1 6205440 Jun 2 09:22 _15djq.fdx : : >> -rw-r--r-- 1 426 Jun 2 09:21 _15djq.fnm : : >> : : >> They are left intact after restarting my daemon. Presumably they are : : >> not treated as being part of the compound index. I see no : : >> corresponding .cfs file for them. : : >> : : >> As a consequence of these - I suspect - I am getting a very large : : >> overall disk requirement for my index, presumably because of : : >> replicated field : : >> : : > data. : : > : : >> My guess is that the field data in the orphaned .fdt files needs to : : >> be regenerated. : : >> : : >> In another index directory from a previous test run (again with : : >> SIGKILLs), : : >> : : > I : : > : : >> have 98 GB of index files, with only 12 BG devoted to compound files : : >> for : : >> : : > the : : > : : >> field index (.cfs). The rest of the disk space is used by orphaned : : >> uncompounded index files; I see 51 GB devoted to uncompounded field : : >> data (.fdt), 13 BG devoted to term positions (.prx) and 13 BG devoted : : >> to term frequencies (.frq). : : >> : : >> Here's my question: : : >> : : >> How can I attempt to merge these orphaned into the compound index, : : >> using IndexWriter.addIndexes(), or would I be foolish attempting this? : : >> : : >> : : > : : > : : : : -- : : regards, : : Volodymyr Bychkoviak : : : : : : : : -Hoss : : : --------------------------------------------------------------------- : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org : For additional commands, e-mail: java-user-help@lucene.apache.org : -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org