Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 81562 invoked from network); 21 Oct 2010 08:40:04 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 21 Oct 2010 08:40:04 -0000 Received: (qmail 57591 invoked by uid 500); 21 Oct 2010 08:40:01 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 57549 invoked by uid 500); 21 Oct 2010 08:40:01 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 57541 invoked by uid 99); 21 Oct 2010 08:40:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Oct 2010 08:40:01 +0000 X-ASF-Spam-Status: No, hits=2.5 required=10.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ian.lea@gmail.com designates 209.85.216.176 as permitted sender) Received: from [209.85.216.176] (HELO mail-qy0-f176.google.com) (209.85.216.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Oct 2010 08:39:53 +0000 Received: by qyk29 with SMTP id 29so3439688qyk.14 for ; Thu, 21 Oct 2010 01:39:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=mK10KIjvDVEYIyJUcKof4gvVpYSPKVf5a8LRYdd6n78=; b=icI3nvceyUpEjRgEunpp2209PzFWFlVKaMF3o0VWs6tClEUA+SHekLbkblR6AmrttB /bTFVxoGOY0FLcmvvYcNeGdv2yizSpTpeypZDb9aO55SjpiP3Mm8OSxt+WSh6rysIoTK tA3anW75iL2iKuDs44Enph9eLFKWHUebmQVVs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=QK925oMXVtXlGZICTg1sQBZ6hfbKySgPQ+vnC3baHvvfYhVEMe/OYfpnB52NYNxXOG sTWM+sIfYw7bHGq0dD/xV+cLNqax4jpkVPKoJx7DbzdooRx+jWJ6Cev+VoW0t2M98/3k mmybOrM1ihMAAcKsmEsO0pYpC8R4Y9VWsMYvk= Received: by 10.229.219.74 with SMTP id ht10mr483404qcb.270.1287650372292; Thu, 21 Oct 2010 01:39:32 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.51.18 with HTTP; Thu, 21 Oct 2010 01:39:11 -0700 (PDT) In-Reply-To: References: From: Ian Lea Date: Thu, 21 Oct 2010 09:39:11 +0100 Message-ID: Subject: Re: how to index large number of files? To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Maybe mostly 1K, but you only need 1 very large doc to cause a problem. I haven't been following this thread, so apologies if I've missed things, but you seem to be having problems running what should be a simple job of the sort that lucene handles every day without breaking sweat. Does it always fail on the same doc or after indexing the same number of docs? What exactly is the exception and stack trace? Are you sure the problem is with your lucene calls, not something else you are doing in the same program? Cutting things down to the smallest, simplest possible standalone program or test case that demonstrates the problem often helps. If you do that and can't find the problem yourself, post the code here along with the stack trace and details such as how many docs it processes before failure and the size of the doc it fails on. -- Ian. On Thu, Oct 21, 2010 at 4:01 AM, Sahin Buyrukbilen wrote: > by the ways file size is not big. mostly 1kB. I =A0am working on wikipedi= a > articles in txt format. > > On Wed, Oct 20, 2010 at 11:01 PM, Sahin Buyrukbilen < > sahin.buyrukbilen@gmail.com> wrote: > >> Unfortunately both methods didnt go through. I am getting memory error e= ven >> at reading the directory contents. >> >> Now, I am thinking this: What if I split 4.5million files into 100.000 (= or >> less depending on java error) files directories, index each of them >> separately and merge those indexes(if possible). >> >> Any suggestions? >> >> >> On Wed, Oct 20, 2010 at 5:47 PM, Erick Erickson wrote: >> >>> My first guess is that you're accumulating too many documents in >>> before the flush gets triggered. The quick-n-dirty way to test this is >>> to do an IndexWriter.flush after every addDocument. This will slow >>> down indexing, but it will also tell you whether this is the problem an= d >>> you can look for more elegant solutions... >>> >>> You can also get some stats by IndexWriter.getRamBufferSizeMB, and >>> can force automatic flushes given a particular ram buffer size via >>> IndexWriter.setRamBufferSizeMB. >>> >>> One thing I'd be interested in is how big your files are. It might be, = are >>> you trying to process a humongous file when it blows? >>> >>> And if none of that helps, please post your stack trace. >>> >>> Best >>> Erick >>> >>> On Wed, Oct 20, 2010 at 2:45 PM, Sahin Buyrukbilen < >>> sahin.buyrukbilen@gmail.com> wrote: >>> >>> > with the different parameters I still got the same error. My code is >>> very >>> > simple, indeed I am only concerned with creating the index and then I >>> will >>> > do some private information retrieval experiments on the inverted ind= ex >>> > file, which I created with the information extracted from the index. >>> That >>> > is >>> > why I didnt go over optimization until now. the database size I had w= as >>> ery >>> > small compared to 4.5million. >>> > My code is as follows: >>> > >>> > public static void createIndex() throws CorruptIndexException, >>> > LockObtainFailedException, IOException { >>> > =A0 =A0 =A0 =A0Analyzer analyzer =3D new StandardAnalyzer(Version.LUC= ENE_30); >>> > =A0 =A0 =A0 =A0Directory indexDir =3D FSDirectory.open(new >>> > File("/media/work/WIKI/indexes/")); >>> > =A0 =A0 =A0 =A0boolean recreateIndexIfExists =3D true; >>> > =A0 =A0 =A0 =A0IndexWriter indexWriter =3D new IndexWriter(indexDir, = analyzer, >>> > recreateIndexIfExists, IndexWriter.MaxFieldLength.UNLIMITED); >>> > =A0 =A0 =A0 =A0indexWriter.setUseCompoundFile(false); >>> > =A0 =A0 =A0 =A0File dir =3D new File(FILES_TO_INDEX_DIRECTORY); >>> > =A0 =A0 =A0 =A0File[] files =3D dir.listFiles(); >>> > =A0 =A0 =A0 =A0for (File file : files) { >>> > =A0 =A0 =A0 =A0 =A0 =A0Document document =3D new Document(); >>> > >>> > =A0 =A0 =A0 =A0 =A0 =A0//String path =3D file.getCanonicalPath(); >>> > =A0 =A0 =A0 =A0 =A0 =A0//document.add(new Field(FIELD_PATH, path, Fie= ld.Store.YES, >>> > Field.Index.NOT_ANALYZED)); >>> > >>> > =A0 =A0 =A0 =A0 =A0 =A0Reader reader =3D new FileReader(file); >>> > =A0 =A0 =A0 =A0 =A0 =A0document.add(new Field(FIELD_CONTENTS, reader)= ); >>> > >>> > =A0 =A0 =A0 =A0 =A0 =A0indexWriter.addDocument(document); >>> > =A0 =A0 =A0 =A0} >>> > =A0 =A0 =A0 =A0indexWriter.optimize(); >>> > =A0 =A0 =A0 =A0indexWriter.close(); >>> > =A0 =A0 } >>> > >>> > >>> > On Wed, Oct 20, 2010 at 2:39 PM, Qi Li wrote: >>> > >>> > > 1. What is the difference when you used different vm parameters? >>> > > 2 =A0What merge policy and optimization strategy did you use? >>> > > 3. How did you use the commit or flush ? >>> > > >>> > > Qi >>> > > >>> > > On Wed, Oct 20, 2010 at 2:05 PM, Sahin Buyrukbilen < >>> > > sahin.buyrukbilen@gmail.com> wrote: >>> > > >>> > > > Thank you so much for this infor. it looks pretty complicated for= me >>> > but >>> > > I >>> > > > will try. >>> > > > >>> > > > >>> > > > >>> > > > On Wed, Oct 20, 2010 at 1:18 AM, Johnbin Wang < >>> johnbin.wang@gmail.com >>> > > > >wrote: >>> > > > >>> > > > > You can start a fixedThreadPool to index all these files in the >>> > > multhread >>> > > > > way. Every thread execute an index task which could index a par= t >>> of >>> > all >>> > > > the >>> > > > > files. In the index task, when indexing 10000 files, you need >>> execute >>> > > the >>> > > > > indexWrite.commit() method to flush all the index add operation= to >>> > disk >>> > > > > file. >>> > > > > >>> > > > > If you need index all these files into only one index file, you >>> need >>> > to >>> > > > > hold >>> > > > > only one indexWriter instance among all the index thread. >>> > > > > >>> > > > > Hope it's helpful. >>> > > > > >>> > > > > >>> > > > > >>> > > > > On Wed, Oct 20, 2010 at 1:05 PM, Sahin Buyrukbilen < >>> > > > > sahin.buyrukbilen@gmail.com> wrote: >>> > > > > >>> > > > > > Thank you Johnbin, >>> > > > > > do you know which parameter I have to play with? >>> > > > > > >>> > > > > > On Wed, Oct 20, 2010 at 12:59 AM, Johnbin Wang < >>> > > johnbin.wang@gmail.com >>> > > > > > >wrote: >>> > > > > > >>> > > > > > > I think you can write index file once every 10,000 files or >>> less >>> > > have >>> > > > > > been >>> > > > > > > read. >>> > > > > > > >>> > > > > > > On Wed, Oct 20, 2010 at 12:11 PM, Sahin Buyrukbilen < >>> > > > > > > sahin.buyrukbilen@gmail.com> wrote: >>> > > > > > > >>> > > > > > > > Hi all, >>> > > > > > > > >>> > > > > > > > I have to index about 4.5Million txt files. When I run th= e >>> my >>> > > > > indexing >>> > > > > > > > application through Eclipse, I get this error : "Exceptio= n >>> in >>> > > > thread >>> > > > > > > "main" >>> > > > > > > > java.lang.OutOfMemoryError: Java heap space" >>> > > > > > > > >>> > > > > > > > eclipse -vmargs -Xmx2000m -Xss8192k >>> > > > > > > > >>> > > > > > > > eclipse -vmargs -Xms40M -Xmx2G >>> > > > > > > > >>> > > > > > > > =A0I tried running Eclipse with above memory parameters, = but >>> > still >>> > > > had >>> > > > > > the >>> > > > > > > > same error. The architecture of my computer is AMD x2 64b= it >>> > 2GHz >>> > > > > > > processor, >>> > > > > > > > Ubuntu 10.04 LTS 64bit. java-6-openjdk. >>> > > > > > > > >>> > > > > > > > Anybody has a suggestion? >>> > > > > > > > >>> > > > > > > > thank you. >>> > > > > > > > Sahin. >>> > > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > -- >>> > > > > > > cheers, >>> > > > > > > Johnbin Wang >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > -- >>> > > > > cheers, >>> > > > > Johnbin Wang >>> > > > > >>> > > > >>> > > >>> > >>> >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org