Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 120DA75C6 for ; Mon, 15 Aug 2011 23:09:32 +0000 (UTC) Received: (qmail 25759 invoked by uid 500); 15 Aug 2011 23:09:30 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 25487 invoked by uid 500); 15 Aug 2011 23:09:29 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 25479 invoked by uid 99); 15 Aug 2011 23:09:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Aug 2011 23:09:28 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of glen.newton@gmail.com designates 209.85.210.46 as permitted sender) Received: from [209.85.210.46] (HELO mail-pz0-f46.google.com) (209.85.210.46) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Aug 2011 23:09:24 +0000 Received: by pzk32 with SMTP id 32so4032485pzk.19 for ; Mon, 15 Aug 2011 16:09:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=XUNNTsk/khGYekofqc4NSnFUs64VOVrH53MWa4w431Y=; b=U12lL7ihAv+mHp5BEYpOppryqjix995OYFszYVoBIQtPY5UlAQux8ShCDlQ08k5t5w nFNjqSEfA4aZAg4lp5916yJRaWe93JCDUE3xnHUtVNX9StQ8eSMi8TCz6RocXLlsfz4x UdowDPIYbh8TpdRzbEwDJwm0zqzGLBphvMnxM= Received: by 10.143.18.21 with SMTP id v21mr2060263wfi.109.1313449744124; Mon, 15 Aug 2011 16:09:04 -0700 (PDT) MIME-Version: 1.0 Received: by 10.68.42.33 with HTTP; Mon, 15 Aug 2011 16:08:44 -0700 (PDT) In-Reply-To: <9E085D377965634187A85638358AE611018F6634E6@DCXPRCL017.cnf.prod.cnf.com> References: <9E085D377965634187A85638358AE611018EEE4E04@DCXPRCL017.cnf.prod.cnf.com> <07B3DFA8-CE17-437F-8969-66233484CAD5@apache.org> <9E085D377965634187A85638358AE611018F6634E6@DCXPRCL017.cnf.prod.cnf.com> From: Glen Newton Date: Mon, 15 Aug 2011 19:08:44 -0400 Message-ID: Subject: Re: What kind of System Resources are required to index 625 million row table...??? To: java-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable > We have increased the up to 4 GB... on an 8 GB machine... > That's why we'd like a methodology for calculating memory requirements > to see if this application is even feasible. Please indicate when you are speaking about the indexing part or the searching part. There are times where it is not clear or ambiguous. :-) The IBM Java VM has a limitation on the size of an NIO buffer. The default is 64MB. This may be impacting your indexing and searching. Consider setting this to a larger size (-XX:MaxDirectMemorySize=3D). Perhaps similar to your RAMBuffer size in your IndexWriter (assuming NIOFSDirectory directory). See https://www.ibm.com/developerworks/java/jdk/aix/j664/sdkguide.aix64.html With regards to the machine, you didn't indicate how much swap you were usi= ng. Heap: hnless there are other things running, you could try up to 7GB of hea= p. You should also consider using huge pages. PPC64 supports 4K(default) and 16M (although this is more likely to speed things up but unlikely solve your heap problem...) General info for AIX and PPC: http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=3D/com.ib= m.aix.prftungd/doc/prftungd/large_page_ovw.htm Java vm command line: "-Xlp AIX: Requests the JVM to allocate the Java heap (the heap from which Java objects are allocated) with large (16 MB) pages, if a size is not specified. If large pages are not available, the Java heap is allocated with the next smaller page size that is supported by the system. AIX requires special configuration to enable large pages. For more information about configuring AIX support for large pages, see http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm.aix.prftung= d/doc/prftungd/large_page_ovw.htm. The SDK supports the use of large pages only to back the Java heap shared memory segments. The JVM uses shmget() with the SHM_LGPG and SHM_PIN flags to allocate large pages. The -Xlp option replaces the environment variable IBM_JAVA_LARGE_PAGE_SIZE, which is now ignored if set. AIX, Linux, and Windows only: If a is specified, the JVM attempts to allocate the JIT code cache memory using pages of that size. If unsuccessful, or if executable pages of that size are not supported, the JIT code cache memory is allocated using the smallest available executable page size." General info on huge pages & Java, MySql, Linux, AIX: http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.ht= ml [my blog] Consider some of the following Java VM command line options (some IBM vm specific): - -Xgcpolicy:subpool "Uses an improved object allocation algorithm to achieve better performance when allocating objects on the heap. This option might improve performance on large SMP systems" - -Xcompressedrefs "Use -Xcompressedrefs in any of these situations: When your Java applications does not need more than a 25 GB Java heap. When your application uses a lot of native memory and needs the JVM to run in a small footprint." - -Xcompactexplicitgc "Enables full compaction each time System.gc() is called." - -Xcompactgc "Compacts on all garbage collections (system and global)." - -Xsoftrefthreshold "Sets the value used by the GC to determine the number of GCs after which a soft reference is cleared if its referent has not been marked. The default is 32, meaning that the soft reference is cleared after 32 * (percentage of free heap space) GC cycles where its referent was not marked." Reducing this will clear out soft references sooner. If any soft referenced-based caching is being used, cache hits will go down but memory will be freed up faster. But this will not directly solve your OOM problem: "All soft references are guaranteed to have been cleared before the OutOfMemoryError is thrown. The default (no compaction option specified) makes the GC compact based on a series of triggers that attempt to compact only when it is beneficial to the future performance of the JVM." - from https://www.ibm.com/developerworks/java/jdk/aix/j664/sdkguide.aix64.html Very useful document on IBM Java VM: "Diagnostics Guide: IBM Developer Kit and Runtime Environment, Java: Technology Edition, Version 6" http://download.boulder.ibm.com/ibmdl/pub/software/dw/jdk/diagnosis/diag60= .pdf [page references refer to this document] Relevant tips from this document on memory management: - "Ensure that the heap never pages; that is, the maximum heap size must be able to be contained in physical memory." p,8 Note that this is a performance tip, not an OOM tip You are using "-Xms4072m -Xmx4072m". The IBM documentation suggests this is not a good choice: "When you have established the maximum heap size that you need, you might want to set the minimum heap size to the same value; for example, -Xms512M -Xmx512M. However, using the same values is typically not a good idea, because it delays the start of garbage collection until the heap is full. Therefore, the first time that the GC runs, the process can take longer. Also, the heap is more likely to be fragmented and require a heap compaction. You are advised to start your application with the minimum heap size that your application requires. When= the GC starts up, it will run frequently and efficiently, because the heap is small." - p43 AIX allows different malloc policies to be used in the underlying system calls. Consider using the WATSON (!) malloc policy. p.134,136 and http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic= =3D/com.ibm.aix.genprogc/doc/genprogc/sys_mem_alloc.htm Finally (or before doing all of this! :-) ), do some profiling, both inside of Java, and of the AIX native heap using svmon (see "Native Heap Exhaustion, p.135). -Glen Newton http://zzzoot.blogspot.com/ On Mon, Aug 15, 2011 at 5:55 PM, Bennett, Tony w= rote: > Thanks for the quick response. > > As to your questions: > > =C2=A0Can you talk a bit more about what the search part of this is? > =C2=A0What are you hoping to get that you don't already have by adding in= search? =C2=A0Choices for fields can have impact on > =C2=A0performance, memory, etc. > > We currently have a "exact match" search facility, which uses SQL. > We would like to add "text search" capabilities... > ...initially, having the ability to search the 229 character field for a = given word, or phrase, instead of an exact match. > A future enhancement would be to add a synonym list. > As to "field choice", yes, it is possible that all fields would be involv= ed in the "search"... > ...in the interest of full disclosure, the fields are: > =C2=A0 - corp =C2=A0- corporation that owns the document > =C2=A0 - type =C2=A0- document type > =C2=A0 - tmst =C2=A0- creation timestamp > =C2=A0 - xmlid - xml namespace ID > =C2=A0 - tag =C2=A0 - meta data qualifier > =C2=A0 - data =C2=A0- actual metadata =C2=A0(example: =C2=A0carton of red= 3 ring binders ) > > > > =C2=A0Was this single threaded or multi-threaded? =C2=A0How big was the r= esulting index? > > The search would be a threaded application. > > =C2=A0How big was the resulting index? > > The index that was built was 70 GB in size. > > =C2=A0Have you tried increasing the heap size? > > We have increased the up to 4 GB... on an 8 GB machine... > That's why we'd like a methodology for calculating memory requirements > to see if this application is even feasible. > > Thanks, > -tony > > > -----Original Message----- > From: Grant Ingersoll [mailto:gsingers@apache.org] > Sent: Monday, August 15, 2011 2:33 PM > To: java-user@lucene.apache.org > Subject: Re: What kind of System Resources are required to index 625 mill= ion row table...??? > > > On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote: > >> We are examining the possibility of using Lucene to provide Text Search >> capabilities for a 625 million row DB2 table. >> >> The table has 6 fields, all which must be stored in the Lucene Index. >> The largest column is 229 characters, the others are 8, 12, 30, and 1...= . >> ...with an additional column that is an 8 byte integer (i.e. a 'C' long = long). > > Can you talk a bit more about what the search part of this is? =C2=A0What= are you hoping to get that you don't already have by adding in search? =C2= =A0Choices for fields can have impact on performance, memory, etc. > >> >> We have written a test app on a development system (AIX 6.1), >> and have successfully Indexed 625 million rows... >> ...which took about 22 hours. > > Was this single threaded or multi-threaded? =C2=A0How big was the resulti= ng index? > > >> >> When writing the "search" application... we find a simple version works,= however, >> if we add a Filter or a "sort" to it... we get an "out of memory" except= ion. >> > > How many terms do you have in your index and in the field you are sorting= /filtering on? =C2=A0Have you tried increasing the heap size? > > >> Before continuing our research, we'd like to find a way to determine >> what system resources are required to run this kind of application...??? > > I don't know that there is a straight forward answer here with the inform= ation you've presented. =C2=A0It can depend on how you intend to search/sor= t/filter/facet, etc. =C2=A0General rule of thumb is that when you get over = 100M documents, you need to shard, but you also have pretty small documents= so your mileage may vary. =C2=A0 I've seen indexes in your range on a sing= le machine (for small docs) with low search volumes, but that isn't to say = it will work for you without more insight into your documents, etc. > >> In other words, how do we calculate the memory needs...??? >> >> Have others created a similar sized Index to run on a single "shared" se= rver...??? >> > > Off the cuff, I think you are pushing the capabilities of doing this on a= single machine, especially the one you have spec'd out below. > >> >> Current Environment: >> >> =C2=A0 =C2=A0 =C2=A0 Lucene Version: 3.2 >> =C2=A0 =C2=A0 =C2=A0 Java Version: =C2=A0 J2RE 6.0 IBM J9 2.4 AIX ppc64-= 64 build jvmap6460-20090215_29883 >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0(i.e. 64 bit Java 6) >> =C2=A0 =C2=A0 =C2=A0 OS: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 AIX 6.1 >> =C2=A0 =C2=A0 =C2=A0 Platform: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 PPC =C2=A0(IBM P520) >> =C2=A0 =C2=A0 =C2=A0 cores: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A02 >> =C2=A0 =C2=A0 =C2=A0 Memory: =C2=A0 =C2=A0 =C2=A0 =C2=A0 8 GB >> =C2=A0 =C2=A0 =C2=A0 jvm memory: =C2=A0 =C2=A0 ` =C2=A0 =C2=A0 =C2=A0 -X= ms4072m -Xmx4072m >> >> Any guidance would be greatly appreciated. >> >> -tony > > -------------------------------------------- > Grant Ingersoll > Lucid Imagination > http://www.lucidimagination.com > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --=20 - --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org