lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: What kind of System Resources are required to index 625 million row table...???
Date Thu, 18 Aug 2011 12:48:39 GMT
Uwe:

Thanks, I guess my mind is still stuck on the really old versions of Solr!

Quick clarification, which part "won't work"? I'm assuming it's the splitting
up of the dates into year, month, and date. Or are you talking about
indexing the dates with coarser granularity? Or both?

Thanks again,
Erick

On Tue, Aug 16, 2011 at 3:46 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> Hi Erick,
>
> This is only true, if you have string fields. Once you have the long values
> in FieldCache they will always use exactly the same space. Having more
> fields will in contrast blow up your IndexReader, as it needs much more RAM
> to hold an even larger term index (because you have an even larger
> termsindex with different fields).
>
> The user told, he is using NumericField, so the uniquness is irrelevant
> here, strings are never used. To make the termindex smaller and reduce ram
> usage, the only suggestion I have is to use a precisionStep of
> Integer.MAX_VALUE for all NumericField that are solely used for sorting. The
> additional terms are only needed for NumericRangeQueries.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: Erick Erickson [mailto:erickerickson@gmail.com]
>> Sent: Tuesday, August 16, 2011 8:14 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: What kind of System Resources are required to index 625
> million
>> row table...???
>>
>> Using a new field with coarser granularity will work fine, this a common
> thing
>> to do for this kind of issue.
>>
>> Lucene is trying to load 625M longs into memory, in addition to any other
> stuff.
>> Ouch!
>>
>> If you want to get really clever, you can index several fields, say year,
> month,
>> and day for each date.. The total number of unique values that need to be
>> sorted then is (the number of years in your corpus + 12 + 31). Very few
> unique
>> values in all. And you can extend this to hours, minutes, seconds and
>> milliseconds which adds a piddling 2,084 unique terms. Of course all your
> sorts
>> have to be re-written to take all these field into account, but it's
> do-able.
>>
>> Warning: This has some gotchas, but.... There is one other thing you can
> try,
>> that's sorting by INDEXORDER. This would only work for you if you index
> the
>> records in date order in the first place, so the first document you
> indexed was
>> the oldest, the second the next-oldest, etc. This won't work if you update
>> existing documents since updates are really delete/add and would mess this
>> ordering up. But the docs don't change, this might do.
>>
>> Best
>> Erick
>>
>>
>> On Tue, Aug 16, 2011 at 10:11 AM, Bennett, Tony <Bennett.Tony@con-
>> way.com> wrote:
>> > Thank you for your response.
>> >
>> > You are correct, we are sorting on timestamp.
>> > Timestamp has microsecond granualarity, and we are storing it as
>> > "NumericField".
>> >
>> > We are sorting on timestamp, so that we can give our users the most
>> > "current" matches, since we are limiting the number of responses to
>> > about 1000.  We are concerned that limiting the number of responses
>> > without sorting, may give the user the "oldest" matches, which is not
>> > what they want.
>> >
>> > Your suggestion about reducing the granularity of the sort is
>> > interesting.  We must "retain" the granularity of the "original"
>> > timestamp for Index maintenance purposes, but we could add another
>> > field, with a granularity of "date" instead of "date+time", which
>> > would be used for sorting only.
>> >
>> > -tony
>> >
>> > -----Original Message-----
>> > From: Erick Erickson [mailto:erickerickson@gmail.com]
>> > Sent: Tuesday, August 16, 2011 5:54 AM
>> > To: java-user@lucene.apache.org
>> > Subject: Re: What kind of System Resources are required to index 625
> million
>> row table...???
>> >
>> > About your OOM. Grant asked a question that's pretty important, how
>> > many unique terms in the field(s) you sorted on? At a guess, you tried
>> > sorting on your timestamp and your timestamp has millisecond or less
>> > granularity, so there are 625M of them.
>> >
>> > Memory requirements for sorting grow as the number of *unique* terms.
>> > So you might be able to reduce the sorting requirements dramatically
>> > if you can use a coarser time granularity.
>> >
>> > And if you're storing your timestamp as a string type, that's even
>> > worse, there are 60 or so bytes of overhead for each string.... see
>> > NumericField....
>> >
>> > And if you can't reduce the granularity of the timestamp, there are
>> > some interesting techniques for reducing the memory requirements of
>> > timestamps that you sort on that we can discuss....
>> >
>> > Luke can answer these questions if you point it at your index, but it
>> > may take a while to examine your index, so be patient.
>> >
>> > Best
>> > Erick
>> >
>> > On Mon, Aug 15, 2011 at 5:55 PM, Bennett, Tony <Bennett.Tony@con-
>> way.com> wrote:
>> >> Thanks for the quick response.
>> >>
>> >> As to your questions:
>> >>
>> >>  Can you talk a bit more about what the search part of this is?
>> >>  What are you hoping to get that you don't already have by adding in
>> >> search?  Choices for fields can have impact on
>> >>  performance, memory, etc.
>> >>
>> >> We currently have a "exact match" search facility, which uses SQL.
>> >> We would like to add "text search" capabilities...
>> >> ...initially, having the ability to search the 229 character field for
> a given
>> word, or phrase, instead of an exact match.
>> >> A future enhancement would be to add a synonym list.
>> >> As to "field choice", yes, it is possible that all fields would be
> involved in the
>> "search"...
>> >> ...in the interest of full disclosure, the fields are:
>> >>   - corp  - corporation that owns the document
>> >>   - type  - document type
>> >>   - tmst  - creation timestamp
>> >>   - xmlid - xml namespace ID
>> >>   - tag   - meta data qualifier
>> >>   - data  - actual metadata  (example:  carton of red 3 ring binders
>> >> )
>> >>
>> >>
>> >>
>> >>  Was this single threaded or multi-threaded?  How big was the resulting
>> index?
>> >>
>> >> The search would be a threaded application.
>> >>
>> >>  How big was the resulting index?
>> >>
>> >> The index that was built was 70 GB in size.
>> >>
>> >>  Have you tried increasing the heap size?
>> >>
>> >> We have increased the up to 4 GB... on an 8 GB machine...
>> >> That's why we'd like a methodology for calculating memory
>> >> requirements to see if this application is even feasible.
>> >>
>> >> Thanks,
>> >> -tony
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: Grant Ingersoll [mailto:gsingers@apache.org]
>> >> Sent: Monday, August 15, 2011 2:33 PM
>> >> To: java-user@lucene.apache.org
>> >> Subject: Re: What kind of System Resources are required to index 625
>> million row table...???
>> >>
>> >>
>> >> On Aug 15, 2011, at 2:39 PM, Bennett, Tony wrote:
>> >>
>> >>> We are examining the possibility of using Lucene to provide Text
>> >>> Search capabilities for a 625 million row DB2 table.
>> >>>
>> >>> The table has 6 fields, all which must be stored in the Lucene Index.
>> >>> The largest column is 229 characters, the others are 8, 12, 30, and
> 1....
>> >>> ...with an additional column that is an 8 byte integer (i.e. a 'C'
> long long).
>> >>
>> >> Can you talk a bit more about what the search part of this is?  What
> are you
>> hoping to get that you don't already have by adding in search?  Choices
> for
>> fields can have impact on performance, memory, etc.
>> >>
>> >>>
>> >>> We have written a test app on a development system (AIX 6.1), and
>> >>> have successfully Indexed 625 million rows...
>> >>> ...which took about 22 hours.
>> >>
>> >> Was this single threaded or multi-threaded?  How big was the resulting
>> index?
>> >>
>> >>
>> >>>
>> >>> When writing the "search" application... we find a simple version
>> >>> works, however, if we add a Filter or a "sort" to it... we get an "out
> of
>> memory" exception.
>> >>>
>> >>
>> >> How many terms do you have in your index and in the field you are
>> sorting/filtering on?  Have you tried increasing the heap size?
>> >>
>> >>
>> >>> Before continuing our research, we'd like to find a way to determine
>> >>> what system resources are required to run this kind of
> application...???
>> >>
>> >> I don't know that there is a straight forward answer here with the
>> information you've presented.  It can depend on how you intend to
>> search/sort/filter/facet, etc.  General rule of thumb is that when you get
> over
>> 100M documents, you need to shard, but you also have pretty small
> documents
>> so your mileage may vary.   I've seen indexes in your range on a single
> machine
>> (for small docs) with low search volumes, but that isn't to say it will
> work for
>> you without more insight into your documents, etc.
>> >>
>> >>> In other words, how do we calculate the memory needs...???
>> >>>
>> >>> Have others created a similar sized Index to run on a single "shared"
>> server...???
>> >>>
>> >>
>> >> Off the cuff, I think you are pushing the capabilities of doing this on
> a single
>> machine, especially the one you have spec'd out below.
>> >>
>> >>>
>> >>> Current Environment:
>> >>>
>> >>>       Lucene Version: 3.2
>> >>>       Java Version:   J2RE 6.0 IBM J9 2.4 AIX ppc64-64 build
>> >>> jvmap6460-20090215_29883
>> >>>                        (i.e. 64 bit Java 6)
>> >>>       OS:                     AIX 6.1
>> >>>       Platform:               PPC  (IBM P520)
>> >>>       cores:          2
>> >>>       Memory:         8 GB
>> >>>       jvm memory:     `       -Xms4072m -Xmx4072m
>> >>>
>> >>> Any guidance would be greatly appreciated.
>> >>>
>> >>> -tony
>> >>
>> >> --------------------------------------------
>> >> Grant Ingersoll
>> >> Lucid Imagination
>> >> http://www.lucidimagination.com
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message