Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 97EC38AB7 for ; Tue, 16 Aug 2011 17:12:31 +0000 (UTC) Received: (qmail 17117 invoked by uid 500); 16 Aug 2011 17:12:29 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 17068 invoked by uid 500); 16 Aug 2011 17:12:28 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 17059 invoked by uid 99); 16 Aug 2011 17:12:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Aug 2011 17:12:28 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [217.146.183.190] (HELO nm16.bullet.mail.ukl.yahoo.com) (217.146.183.190) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 16 Aug 2011 17:12:21 +0000 Received: from [217.146.183.217] by nm16.bullet.mail.ukl.yahoo.com with NNFMP; 16 Aug 2011 17:11:59 -0000 Received: from [77.238.184.80] by tm10.bullet.mail.ukl.yahoo.com with NNFMP; 16 Aug 2011 17:11:59 -0000 Received: from [127.0.0.1] by smtp149.mail.ukl.yahoo.com with NNFMP; 16 Aug 2011 17:11:59 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.co.uk; s=s1024; t=1313514719; bh=euZH3anA8mQbgGgnWkgWrkgE8BpVmwni9N0x4od6kiY=; h=X-Yahoo-Newman-Id:X-Yahoo-Newman-Property:X-YMail-OSG:X-Yahoo-SMTP:Received:Content-Type:Mime-Version:Subject:From:In-Reply-To:Date:Content-Transfer-Encoding:Message-Id:References:To:X-Mailer; b=Qczgixr5cI3nVQJm0NGaMHMliJAZLDwEhAfNJbU2uX/2Ibu4jWI5CNoIB4DaI+6+F51k6w53tLGKFUoWgGO9h7aFSCP30ixwrfP0h4S+HMOKLl+MaC6kt1XcVk4EbUQj79YGRG3rROhTgXB3u8zZkyG1a+y9I5kGRVNoquebYx8= X-Yahoo-Newman-Id: 590788.48336.bm@smtp149.mail.ukl.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: iRVz0JAVM1nzC_80mR8hErp.z7fF0vht54H1YMgsFVnV5ze gwTf0KaRbTfiQLJSEpbDsNq_08HH96W5xyw.dfkVVE60muHW_NrAh8JQsrjy j8J6npZjRwmq82JnQpo0CzafuqGu3fS.4SypRRSjounCiqTRwX43oYBLAqiu 5M44zZZ32ddPFjaZJP1lZlFniXb6L3oGrgjyxD3UCJIvf_Lp1fEXDBcsmmI2 ABSqNW6gLUZGAu02RM9LFhTyZoYtoTycQkFt7WNCv.2uamhwgExp_DinKN5q CW9knxX7BTFRkfJzOE4OgwEX6v_hwXqJYaSUwB1LNXC4syPes1jtcwpmCJGr G5vPoF0cmhRcqbuqj3QAY5G0pWrgDqP.dQ7kJdbgcsbMQZQA9s9ASIIoYXoB 2ANDTbmx1O6gI2rdkRObO3hvsGJ_A4dT3ANOMPBk0kgovkD9wAsNRLHx_eoi AfTcGtl.5Sg-- X-Yahoo-SMTP: rdTGKYaswBBzjeOz17cOw.4gNsDMDjs6 Received: from [192.168.1.50] (markharw00d@194.106.34.5 with plain) by smtp149.mail.ukl.yahoo.com with SMTP; 16 Aug 2011 17:11:59 +0000 GMT Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1084) Subject: Re: What kind of System Resources are required to index 625 million row table...??? From: Mark Harwood In-Reply-To: <9E085D377965634187A85638358AE611018F663B92@DCXPRCL017.cnf.prod.cnf.com> Date: Tue, 16 Aug 2011 18:11:58 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <7B161AC3-8452-482C-805E-1B0A1F29822C@yahoo.co.uk> References: <9E085D377965634187A85638358AE611018EEE4E04@DCXPRCL017.cnf.prod.cnf.com> <07B3DFA8-CE17-437F-8969-66233484CAD5@apache.org> <9E085D377965634187A85638358AE611018F6634E6@DCXPRCL017.cnf.prod.cnf.com> <9E085D377965634187A85638358AE611018F663B92@DCXPRCL017.cnf.prod.cnf.com> To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.1084) Check "norms" are disabled on your fields because they'll cost you1byte = x NumberOfDocs x numberOfFieldsWithNormsEnabled. On 16 Aug 2011, at 15:11, Bennett, Tony wrote: > Thank you for your response. >=20 > You are correct, we are sorting on timestamp. > Timestamp has microsecond granualarity, and we are > storing it as "NumericField". >=20 > We are sorting on timestamp, so that we can give our > users the most "current" matches, since we are limiting > the number of responses to about 1000. We are concerned > that limiting the number of responses without sorting, > may give the user the "oldest" matches, which is not=20 > what they want. >=20 > Your suggestion about reducing the granularity of the=20 > sort is interesting. We must "retain" the granularity > of the "original" timestamp for Index maintenance purposes, > but we could add another field, with a granularity of=20 > "date" instead of "date+time", which would be used for=20 > sorting only.=20 >=20 > -tony >=20 > -----Original Message----- > From: Erick Erickson [mailto:erickerickson@gmail.com]=20 > Sent: Tuesday, August 16, 2011 5:54 AM > To: java-user@lucene.apache.org > Subject: Re: What kind of System Resources are required to index 625 = million row table...??? >=20 > About your OOM. Grant asked a question that's pretty important, > how many unique terms in the field(s) you sorted on? At a guess, > you tried sorting on your timestamp and your timestamp has > millisecond or less granularity, so there are 625M of them. >=20 > Memory requirements for sorting grow as the number of *unique* > terms. So you might be able to reduce the sorting requirements > dramatically if you can use a coarser time granularity. >=20 > And if you're storing your timestamp as a string type, that's > even worse, there are 60 or so bytes of overhead for > each string.... see NumericField.... >=20 > And if you can't reduce the granularity of the timestamp, there > are some interesting techniques for reducing the memory > requirements of timestamps that you sort on that we can discuss.... >=20 > Luke can answer these questions if you point it at your index, > but it may take a while to examine your index, so be patient. >=20 > Best > Erick >=20 > On Mon, Aug 15, 2011 at 5:55 PM, Bennett, Tony = wrote: >> Thanks for the quick response. >>=20 >> As to your questions: >>=20 >> Can you talk a bit more about what the search part of this is? >> What are you hoping to get that you don't already have by adding in = search? Choices for fields can have impact on >> performance, memory, etc. >>=20 >> We currently have a "exact match" search facility, which uses SQL. >> We would like to add "text search" capabilities... >> ...initially, having the ability to search the 229 character field = for a given word, or phrase, instead of an exact match. >> A future enhancement would be to add a synonym list. >> As to "field choice", yes, it is possible that all fields would be = involved in the "search"... >> ...in the interest of full disclosure, the fields are: >> - corp - corporation that owns the document >> - type - document type >> - tmst - creation timestamp >> - xmlid - xml namespace ID >> - tag - meta data qualifier >> - data - actual metadata (example: carton of red 3 ring binders = ) >>=20 >>=20 >>=20 >> Was this single threaded or multi-threaded? How big was the = resulting index? >>=20 >> The search would be a threaded application. >>=20 >> How big was the resulting index? >>=20 >> The index that was built was 70 GB in size. >>=20 >> Have you tried increasing the heap size? >>=20 >> We have increased the up to 4 GB... on an 8 GB machine... >> That's why we'd like a methodology for calculating memory = requirements >> to see if this application is even feasible. >>=20 >> Thanks, >> -tony >>=20 >>=20 >> -----Original Message----- >> From: Grant Ingersoll [mailto:gsingers@apache.org] >> Sent: Monday, August 15, 2011 2:33 PM >> To: java-user@lucene.apache.org >> Subject: Re: What kind of System Resources are required to index 625 = million row table...??? >>=20 >>=20 >> On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote: >>=20 >>> We are examining the possibility of using Lucene to provide Text = Search >>> capabilities for a 625 million row DB2 table. >>>=20 >>> The table has 6 fields, all which must be stored in the Lucene = Index. >>> The largest column is 229 characters, the others are 8, 12, 30, and = 1.... >>> ...with an additional column that is an 8 byte integer (i.e. a 'C' = long long). >>=20 >> Can you talk a bit more about what the search part of this is? What = are you hoping to get that you don't already have by adding in search? = Choices for fields can have impact on performance, memory, etc. >>=20 >>>=20 >>> We have written a test app on a development system (AIX 6.1), >>> and have successfully Indexed 625 million rows... >>> ...which took about 22 hours. >>=20 >> Was this single threaded or multi-threaded? How big was the = resulting index? >>=20 >>=20 >>>=20 >>> When writing the "search" application... we find a simple version = works, however, >>> if we add a Filter or a "sort" to it... we get an "out of memory" = exception. >>>=20 >>=20 >> How many terms do you have in your index and in the field you are = sorting/filtering on? Have you tried increasing the heap size? >>=20 >>=20 >>> Before continuing our research, we'd like to find a way to determine >>> what system resources are required to run this kind of = application...??? >>=20 >> I don't know that there is a straight forward answer here with the = information you've presented. It can depend on how you intend to = search/sort/filter/facet, etc. General rule of thumb is that when you = get over 100M documents, you need to shard, but you also have pretty = small documents so your mileage may vary. I've seen indexes in your = range on a single machine (for small docs) with low search volumes, but = that isn't to say it will work for you without more insight into your = documents, etc. >>=20 >>> In other words, how do we calculate the memory needs...??? >>>=20 >>> Have others created a similar sized Index to run on a single = "shared" server...??? >>>=20 >>=20 >> Off the cuff, I think you are pushing the capabilities of doing this = on a single machine, especially the one you have spec'd out below. >>=20 >>>=20 >>> Current Environment: >>>=20 >>> Lucene Version: 3.2 >>> Java Version: J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build = jvmap6460-20090215_29883 >>> (i.e. 64 bit Java 6) >>> OS: AIX 6.1 >>> Platform: PPC (IBM P520) >>> cores: 2 >>> Memory: 8 GB >>> jvm memory: ` -Xms4072m -Xmx4072m >>>=20 >>> Any guidance would be greatly appreciated. >>>=20 >>> -tony >>=20 >> -------------------------------------------- >> Grant Ingersoll >> Lucid Imagination >> http://www.lucidimagination.com >>=20 >>=20 >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >>=20 >>=20 >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >>=20 >>=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org