Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1084)
Subject: Re: What kind of System Resources are required to index 625 million
 row table...???
From: Mark Harwood <markharw00d@yahoo.co.uk>
In-Reply-To: 
 <9E085D377965634187A85638358AE611018F663B92@DCXPRCL017.cnf.prod.cnf.com>
Date: Tue, 16 Aug 2011 18:11:58 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <7B161AC3-8452-482C-805E-1B0A1F29822C@yahoo.co.uk>
References: 
 <9E085D377965634187A85638358AE611018EEE4E04@DCXPRCL017.cnf.prod.cnf.com>
 <07B3DFA8-CE17-437F-8969-66233484CAD5@apache.org>
 <9E085D377965634187A85638358AE611018F6634E6@DCXPRCL017.cnf.prod.cnf.com>
 <CAN4YXvfz3aETrLX+Y0vh+nfsih_O-F_6=a23Hgymo_M+LPkYWQ@mail.gmail.com>
 <9E085D377965634187A85638358AE611018F663B92@DCXPRCL017.cnf.prod.cnf.com>
To: java-user@lucene.apache.org

Check  "norms" are disabled on your fields because they'll cost you1byte =
x NumberOfDocs x numberOfFieldsWithNormsEnabled.


On 16 Aug 2011, at 15:11, Bennett, Tony wrote:

> Thank you for your response.
>=20
> You are correct, we are sorting on timestamp.
> Timestamp has microsecond granualarity, and we are
> storing it as "NumericField".
>=20
> We are sorting on timestamp, so that we can give our
> users the most "current" matches, since we are limiting
> the number of responses to about 1000.  We are concerned
> that limiting the number of responses without sorting,
> may give the user the "oldest" matches, which is not=20
> what they want.
>=20
> Your suggestion about reducing the granularity of the=20
> sort is interesting.  We must "retain" the granularity
> of the "original" timestamp for Index maintenance purposes,
> but we could add another field, with a granularity of=20
> "date" instead of "date+time", which would be used for=20
> sorting only.=20
>=20
> -tony
>=20
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]=20
> Sent: Tuesday, August 16, 2011 5:54 AM
> To: java-user@lucene.apache.org
> Subject: Re: What kind of System Resources are required to index 625 =
million row table...???
>=20
> About your OOM. Grant asked a question that's pretty important,
> how many unique terms in the field(s) you sorted on? At a guess,
> you tried sorting on your timestamp and your timestamp has
> millisecond or less granularity, so there are 625M of them.
>=20
> Memory requirements for sorting grow as the number of *unique*
> terms. So you might be able to reduce the sorting requirements
> dramatically if you can use a coarser time granularity.
>=20
> And if you're storing your timestamp as a string type, that's
> even worse, there are 60 or so bytes of overhead for
> each string.... see NumericField....
>=20
> And if you can't reduce the granularity of the timestamp, there
> are some interesting techniques for reducing the memory
> requirements of timestamps that you sort on that we can discuss....
>=20
> Luke can answer these questions if you point it at your index,
> but it may take a while to examine your index, so be patient.
>=20
> Best
> Erick
>=20
> On Mon, Aug 15, 2011 at 5:55 PM, Bennett, Tony =
<Bennett.Tony@con-way.com> wrote:
>> Thanks for the quick response.
>>=20
>> As to your questions:
>>=20
>>  Can you talk a bit more about what the search part of this is?
>>  What are you hoping to get that you don't already have by adding in =
search?  Choices for fields can have impact on
>>  performance, memory, etc.
>>=20
>> We currently have a "exact match" search facility, which uses SQL.
>> We would like to add "text search" capabilities...
>> ...initially, having the ability to search the 229 character field =
for a given word, or phrase, instead of an exact match.
>> A future enhancement would be to add a synonym list.
>> As to "field choice", yes, it is possible that all fields would be =
involved in the "search"...
>> ...in the interest of full disclosure, the fields are:
>>   - corp  - corporation that owns the document
>>   - type  - document type
>>   - tmst  - creation timestamp
>>   - xmlid - xml namespace ID
>>   - tag   - meta data qualifier
>>   - data  - actual metadata  (example:  carton of red 3 ring binders =
)
>>=20
>>=20
>>=20
>>  Was this single threaded or multi-threaded?  How big was the =
resulting index?
>>=20
>> The search would be a threaded application.
>>=20
>>  How big was the resulting index?
>>=20
>> The index that was built was 70 GB in size.
>>=20
>>  Have you tried increasing the heap size?
>>=20
>> We have increased the up to 4 GB... on an 8 GB machine...
>> That's why we'd like a methodology for calculating memory =
requirements
>> to see if this application is even feasible.
>>=20
>> Thanks,
>> -tony
>>=20
>>=20
>> -----Original Message-----
>> From: Grant Ingersoll [mailto:gsingers@apache.org]
>> Sent: Monday, August 15, 2011 2:33 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: What kind of System Resources are required to index 625 =
million row table...???
>>=20
>>=20
>> On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote:
>>=20
>>> We are examining the possibility of using Lucene to provide Text =
Search
>>> capabilities for a 625 million row DB2 table.
>>>=20
>>> The table has 6 fields, all which must be stored in the Lucene =
Index.
>>> The largest column is 229 characters, the others are 8, 12, 30, and =
1....
>>> ...with an additional column that is an 8 byte integer (i.e. a 'C' =
long long).
>>=20
>> Can you talk a bit more about what the search part of this is?  What =
are you hoping to get that you don't already have by adding in search?  =
Choices for fields can have impact on performance, memory, etc.
>>=20
>>>=20
>>> We have written a test app on a development system (AIX 6.1),
>>> and have successfully Indexed 625 million rows...
>>> ...which took about 22 hours.
>>=20
>> Was this single threaded or multi-threaded?  How big was the =
resulting index?
>>=20
>>=20
>>>=20
>>> When writing the "search" application... we find a simple version =
works, however,
>>> if we add a Filter or a "sort" to it... we get an "out of memory" =
exception.
>>>=20
>>=20
>> How many terms do you have in your index and in the field you are =
sorting/filtering on?  Have you tried increasing the heap size?
>>=20
>>=20
>>> Before continuing our research, we'd like to find a way to determine
>>> what system resources are required to run this kind of =
application...???
>>=20
>> I don't know that there is a straight forward answer here with the =
information you've presented.  It can depend on how you intend to =
search/sort/filter/facet, etc.  General rule of thumb is that when you =
get over 100M documents, you need to shard, but you also have pretty =
small documents so your mileage may vary.   I've seen indexes in your =
range on a single machine (for small docs) with low search volumes, but =
that isn't to say it will work for you without more insight into your =
documents, etc.
>>=20
>>> In other words, how do we calculate the memory needs...???
>>>=20
>>> Have others created a similar sized Index to run on a single =
"shared" server...???
>>>=20
>>=20
>> Off the cuff, I think you are pushing the capabilities of doing this =
on a single machine, especially the one you have spec'd out below.
>>=20
>>>=20
>>> Current Environment:
>>>=20
>>>       Lucene Version: 3.2
>>>       Java Version:   J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build =
jvmap6460-20090215_29883
>>>                        (i.e. 64 bit Java 6)
>>>       OS:                     AIX 6.1
>>>       Platform:               PPC  (IBM P520)
>>>       cores:          2
>>>       Memory:         8 GB
>>>       jvm memory:     `       -Xms4072m -Xmx4072m
>>>=20
>>> Any guidance would be greatly appreciated.
>>>=20
>>> -tony
>>=20
>> --------------------------------------------
>> Grant Ingersoll
>> Lucid Imagination
>> http://www.lucidimagination.com
>>=20
>>=20
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>=20
>>=20
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>=20
>>=20
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>=20
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>=20


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org