Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 8314 invoked from network); 3 Apr 2007 10:22:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 Apr 2007 10:22:01 -0000 Received: (qmail 11864 invoked by uid 500); 3 Apr 2007 10:22:01 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 11773 invoked by uid 500); 3 Apr 2007 10:22:00 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 11754 invoked by uid 99); 3 Apr 2007 10:22:00 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Apr 2007 03:22:00 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Apr 2007 03:21:52 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 9DE58714068 for ; Tue, 3 Apr 2007 03:21:32 -0700 (PDT) Message-ID: <24163011.1175595692639.JavaMail.jira@brutus> Date: Tue, 3 Apr 2007 03:21:32 -0700 (PDT) From: "Michael McCandless (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents In-Reply-To: <16648050.1174583194037.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486293 ] Michael McCandless commented on LUCENE-843: ------------------------------------------- To do the benchmarking I created a simple standalone tool (demo/IndexLineFiles, in the last patch) that indexes one line at a time from a large previously created file, optionally using multiple threads. I do it this way to minimize IO cost of pulling the document source because I want to measure just indexing time as much as possible. Each line is read and a doc is created with field "contents" that is not stored, is tokenized, and optionally has term vectors with position+offsets. I also optionally add two small only-stored fields ("path" and "modified"). I think these are fairly trivial documents compared to typical usage of Lucene. For the corpus, I took Europarl's "en" content, stripped tags, and processed into 3 files: one with 100 tokens per line (= ~550 bytes), one with 1000 tokens per line (= ~5,500 bytes) and with 10000 tokens per line (= ~55,000 bytes) plain text per line. All settings (mergeFactor, compound file, etc.) are left at defaults. I don't optimize the index in the end. I'm using my new SimpleSpaceAnalyzer (just splits token on the space character and creates token text as slice into a char[] array instead of new String(...)) to minimize the cost of tokenization. I ran the tests with Java 1.5 on a Mac Pro quad (2 Intel CPUs, each dual core) OS X box with 2 GB RAM. I give java 1 GB heap (-Xmx1024m). > improve how IndexWriter uses RAM to buffer added documents > ---------------------------------------------------------- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.2 > Reporter: Michael McCandless > Assigned To: Michael McCandless > Priority: Minor > Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch > > > I'm working on a new class (MultiDocumentWriter) that writes more than > one document directly into a single Lucene segment, more efficiently > than the current approach. > This only affects the creation of an initial segment from added > documents. I haven't changed anything after that, eg how segments are > merged. > The basic ideas are: > * Write stored fields and term vectors directly to disk (don't > use up RAM for these). > * Gather posting lists & term infos in RAM, but periodically do > in-RAM merges. Once RAM is full, flush buffers to disk (and > merge them later when it's time to make a real segment). > * Recycle objects/buffers to reduce time/stress in GC. > * Other various optimizations. > Some of these changes are similar to how KinoSearch builds a segment. > But, I haven't made any changes to Lucene's file format nor added > requirements for a global fields schema. > So far the only externally visible change is a new method > "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is > deprecated) so that it flushes according to RAM usage and not a fixed > number documents added. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org