Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 92067 invoked from network); 3 Jan 2011 16:36:05 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Jan 2011 16:36:05 -0000 Received: (qmail 57726 invoked by uid 500); 3 Jan 2011 16:36:03 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 57668 invoked by uid 500); 3 Jan 2011 16:36:02 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 57660 invoked by uid 99); 3 Jan 2011 16:36:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Jan 2011 16:36:02 +0000 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=SPF_NEUTRAL,URI_HEX X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [216.139.236.26] (HELO sam.nabble.com) (216.139.236.26) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Jan 2011 16:35:57 +0000 Received: from ben.nabble.com ([192.168.236.152]) by sam.nabble.com with esmtp (Exim 4.69) (envelope-from ) id 1PZnNg-0002zG-Ds for java-user@lucene.apache.org; Mon, 03 Jan 2011 08:35:36 -0800 Date: Mon, 3 Jan 2011 08:35:36 -0800 (PST) From: Alex vB To: java-user@lucene.apache.org Message-ID: <1294072536426-2185926.post@n3.nabble.com> Subject: Indexing large XML dumps MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Hello everybody, I am currently indexing wikipedia dumps and create an index for versioned document collections. As far everything is working fine but I have never thought that single articles of wikipedia would reach a size of around 2 GB! One article for example has 20000 versions with an average length of 60000 characters for each (HUGE in memory!). This means I need a heap space around 4 GB to perform indexing and I would like to decrease my memory consumption ;). At the moment I load every wikipedia article completely into memory containing all versions. Then I collect some statistical data about the article to store extra information about term occurences which are written into the index as payloads. The statistic is created during an own tokenization run which happens before the document is written into index. This means I am analyzing my documents twice! :( I know there is a CachingTokenFilter but I haven't found how and where to implement it exactly (I tried it in my Analyzer but stream.reset() seems not to work). Does somebody have a nice example? 1) Can I somehow avoid loading one complete article to get my statistics? 2) Is it possible to index large files without completely loading it into a field? 3) How can I avoid to parse an article twice? Best regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-large-XML-dumps-tp2185926p2185926.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org