lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Right memory for search application
Date Wed, 28 Apr 2010 14:57:17 GMT
Quick reply to question (4). Not quite right, you're not gaining anything,
you still have a yymmddHHMMSS field that will consume all the memory
when you sort by it.

Remember that the controlling part here is the number of unique values. So
think about two fields, yymmdd and HHMMSS. Use the HHMMSS as the
secondary sort and yymmdd as the primary. This will sort correctly since
any time HHMMSS is used, the yymmdd is guaranteed be equal.

And you can extend this ad nauseum. For instance, you could use 6
fields, yy, mm, dd, HH, MM, SS and have a very small number of
unique values in each using really tiny amounts of memory to sort down
to the second in this case.

Best
Erick

On Wed, Apr 28, 2010 at 2:24 AM, Samarendra Pratap <samarzone@gmail.com>wrote:

> I have got a lot of valuable information in this thread so far.
> Thanks to all.
>
> In my last mail I mentioned only two fields because others' usage was
> negligible and I thought they are not important. But now after *Toke
> *explained
> the formulae, I think sorting on those fields would also be consuming a
> huge
> part of memory.There are 2 other sorting fields; one of which is used in
> both ascending/descending sorting.
>
> Within next couple of days (or may be a week) I'll be
>
> 1. profiling my application,
> 2. analyzing and tuning GC options
>
>
>
> However, I have a few more curiosities -
>
> 1. Tom wrote:
>
> *Have you checked that your machine is correctly identified as a server*
> *and has optimized GC settings?*
>
> *I did not understand the meaning of "correctly identified as a server" Can
> you please help me understand?*
> *
> *
> 2. *Should I change the type of fields?*
> ** As I said in my first mail that I have 56 fields in my index, most of
> them contain a numeric value or one of system defined values (e.g. gender
> field can contain only "male", "female", or "unknown"). There are only 7
> fields which are indexed with user defined values.
> All the fields are created with *Field*
> (String name, String value, Field.Store store, Field.Index index)
> It would be creating all the fields as normal string fields. Is it
> *always*a good idea to use specific classes (NumericField, DateTime
> etc.). We do not
> have space problem if that matters.
>
> 3. *Is there any advice on number of fields?*
> *Somewhere on the net I read that instead of keeping different type of
> values in different fields, (e.g. field1:value1, field2:value2,...) one
> should practice keeping different values in single field (e.g.
> field:field1_value1,
> field:field2_value2,...). But I could not confirm it from anywhere else.
> Any
> comments?*
>
> 4. Ian wrote:
>
> *Sorting by score down to the second will use a lot of memory.  Can you*
> *make it less granular?*
>
> Is it less painful sorting on two fields; first on yymmdd and then on
> yymmddHHMMSS than sorting just on latter? (Naturally it should use second
> field, only where required but technically ...?)
>
>
> Thanks again for the invaluable support I am getting from here.
>
> - Samar
>
> On Wed, Apr 28, 2010 at 9:12 AM, Lance Norskog <goksron@gmail.com> wrote:
>
> > Solr's timestamp representation (TrieDateField) is tuned for space and
> > speed. It has a compressed representation, and sorts with far less
> > space than Strings.
> >
> > Also you get something called a date facet, which lets you bucketize
> > facet searches by time block.
> >
> > On Tue, Apr 27, 2010 at 1:02 PM, Toke Eskildsen <te@statsbiblioteket.dk>
> > wrote:
> > > Samarendra Pratap [samarzone@gmail.com] wrote:
> > >> 1. Our default option is sort by score, however almost 8% of searches
> > use
> > >> sorting on a field (yyyymmddHHMMSS). This field is indexed as string
> > (not as
> > >> NumericField or DateField).
> > >
> > > Guessing that the timestamp is practically unique for each document,
> > sorting by String takes up a bit more than
> > > 18M * (40 bytes + 2 * "yyyymmddHHMMSS".length() bytes) ~= 1.2 GB of RAM
> > as the Strings are cached. Coupled with the normal overhead of just
> opening
> > an index of your size (500MB by your measurements?), I would have guessed
> > that 3600MB would definitely be enough to open the index and do sorted
> > searches.
> > >
> > > I realize that fiddling with production servers is dangerous, but
> > connecting with JConsole and forcing a garbage collection might be
> > acceptable? That should enable you to determine whether you're leaking
> > memory or if it's just the JVM being greedy. I'd guess you leaking
> though,
> > as HotSpot does not normally allocate up to the limit if it does not need
> > to.
> > >
> > > Anyway, changing to one of the optimized fields for sorting dates
> should
> > shave 1 GB off the memory requirement, so I'll recommend doing that no
> > matter what the main cause of your memory problems is.
> > >
> > > Regards,
> > > Toke Eskildsen
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> --
> Regards,
> Samar
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message