lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bennett, Tony" <Bennett.T...@con-way.com>
Subject RE: What kind of System Resources are required to index 625 million row table...???
Date Mon, 15 Aug 2011 21:55:35 GMT
Thanks for the quick response.

As to your questions:

  Can you talk a bit more about what the search part of this is?  
  What are you hoping to get that you don't already have by adding in search?  Choices for
fields can have impact on 
  performance, memory, etc.

We currently have a "exact match" search facility, which uses SQL.
We would like to add "text search" capabilities...
...initially, having the ability to search the 229 character field for a given word, or phrase,
instead of an exact match.
A future enhancement would be to add a synonym list.
As to "field choice", yes, it is possible that all fields would be involved in the "search"...
...in the interest of full disclosure, the fields are:
   - corp  - corporation that owns the document
   - type  - document type
   - tmst  - creation timestamp
   - xmlid - xml namespace ID
   - tag   - meta data qualifier
   - data  - actual metadata  (example:  carton of red 3 ring binders )



  Was this single threaded or multi-threaded?  How big was the resulting index?

The search would be a threaded application.

  How big was the resulting index?

The index that was built was 70 GB in size.

  Have you tried increasing the heap size?

We have increased the up to 4 GB... on an 8 GB machine...
That's why we'd like a methodology for calculating memory requirements
to see if this application is even feasible.

Thanks,
-tony 


-----Original Message-----
From: Grant Ingersoll [mailto:gsingers@apache.org] 
Sent: Monday, August 15, 2011 2:33 PM
To: java-user@lucene.apache.org
Subject: Re: What kind of System Resources are required to index 625 million row table...???


On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote:

> We are examining the possibility of using Lucene to provide Text Search 
> capabilities for a 625 million row DB2 table.
> 
> The table has 6 fields, all which must be stored in the Lucene Index.  
> The largest column is 229 characters, the others are 8, 12, 30, and 1....
> ...with an additional column that is an 8 byte integer (i.e. a 'C' long long).

Can you talk a bit more about what the search part of this is?  What are you hoping to get
that you don't already have by adding in search?  Choices for fields can have impact on performance,
memory, etc.

> 
> We have written a test app on a development system (AIX 6.1),
> and have successfully Indexed 625 million rows...
> ...which took about 22 hours.

Was this single threaded or multi-threaded?  How big was the resulting index?


> 
> When writing the "search" application... we find a simple version works, however,
> if we add a Filter or a "sort" to it... we get an "out of memory" exception.
> 

How many terms do you have in your index and in the field you are sorting/filtering on?  Have
you tried increasing the heap size?


> Before continuing our research, we'd like to find a way to determine 
> what system resources are required to run this kind of application...???

I don't know that there is a straight forward answer here with the information you've presented.
 It can depend on how you intend to search/sort/filter/facet, etc.  General rule of thumb
is that when you get over 100M documents, you need to shard, but you also have pretty small
documents so your mileage may vary.   I've seen indexes in your range on a single machine
(for small docs) with low search volumes, but that isn't to say it will work for you without
more insight into your documents, etc.

> In other words, how do we calculate the memory needs...???
> 
> Have others created a similar sized Index to run on a single "shared" server...???
> 

Off the cuff, I think you are pushing the capabilities of doing this on a single machine,
especially the one you have spec'd out below.

> 
> Current Environment:
> 
> 	Lucene Version:	3.2
> 	Java Version:	J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build jvmap6460-20090215_29883
>                        (i.e. 64 bit Java 6)
> 	OS:			AIX 6.1
> 	Platform:		PPC  (IBM P520)
> 	cores:		2
> 	Memory:		8 GB
> 	jvm memory:	`	-Xms4072m -Xmx4072m
> 
> Any guidance would be greatly appreciated.
> 
> -tony

--------------------------------------------
Grant Ingersoll
Lucid Imagination
http://www.lucidimagination.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message