lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samarendra Pratap <samarz...@gmail.com>
Subject Re: Right memory for search application
Date Thu, 29 Apr 2010 05:35:38 GMT
Great explanation Erick.
Thanks. I'll try that.

On Wed, Apr 28, 2010 at 8:27 PM, Erick Erickson <erickerickson@gmail.com>wrote:

> Quick reply to question (4). Not quite right, you're not gaining anything,
> you still have a yymmddHHMMSS field that will consume all the memory
> when you sort by it.
>
> Remember that the controlling part here is the number of unique values. So
> think about two fields, yymmdd and HHMMSS. Use the HHMMSS as the
> secondary sort and yymmdd as the primary. This will sort correctly since
> any time HHMMSS is used, the yymmdd is guaranteed be equal.
>
> And you can extend this ad nauseum. For instance, you could use 6
> fields, yy, mm, dd, HH, MM, SS and have a very small number of
> unique values in each using really tiny amounts of memory to sort down
> to the second in this case.
>
> Best
> Erick
>
> On Wed, Apr 28, 2010 at 2:24 AM, Samarendra Pratap <samarzone@gmail.com
> >wrote:
>
> > I have got a lot of valuable information in this thread so far.
> > Thanks to all.
> >
> > In my last mail I mentioned only two fields because others' usage was
> > negligible and I thought they are not important. But now after *Toke
> > *explained
> > the formulae, I think sorting on those fields would also be consuming a
> > huge
> > part of memory.There are 2 other sorting fields; one of which is used in
> > both ascending/descending sorting.
> >
> > Within next couple of days (or may be a week) I'll be
> >
> > 1. profiling my application,
> > 2. analyzing and tuning GC options
> >
> >
> >
> > However, I have a few more curiosities -
> >
> > 1. Tom wrote:
> >
> > *Have you checked that your machine is correctly identified as a server*
> > *and has optimized GC settings?*
> >
> > *I did not understand the meaning of "correctly identified as a server"
> Can
> > you please help me understand?*
> > *
> > *
> > 2. *Should I change the type of fields?*
> > ** As I said in my first mail that I have 56 fields in my index, most of
> > them contain a numeric value or one of system defined values (e.g. gender
> > field can contain only "male", "female", or "unknown"). There are only 7
> > fields which are indexed with user defined values.
> > All the fields are created with *Field*
> > (String name, String value, Field.Store store, Field.Index index)
> > It would be creating all the fields as normal string fields. Is it
> > *always*a good idea to use specific classes (NumericField, DateTime
> > etc.). We do not
> > have space problem if that matters.
> >
> > 3. *Is there any advice on number of fields?*
> > *Somewhere on the net I read that instead of keeping different type of
> > values in different fields, (e.g. field1:value1, field2:value2,...) one
> > should practice keeping different values in single field (e.g.
> > field:field1_value1,
> > field:field2_value2,...). But I could not confirm it from anywhere else.
> > Any
> > comments?*
> >
> > 4. Ian wrote:
> >
> > *Sorting by score down to the second will use a lot of memory.  Can you*
> > *make it less granular?*
> >
> > Is it less painful sorting on two fields; first on yymmdd and then on
> > yymmddHHMMSS than sorting just on latter? (Naturally it should use second
> > field, only where required but technically ...?)
> >
> >
> > Thanks again for the invaluable support I am getting from here.
> >
> > - Samar
> >
> > On Wed, Apr 28, 2010 at 9:12 AM, Lance Norskog <goksron@gmail.com>
> wrote:
> >
> > > Solr's timestamp representation (TrieDateField) is tuned for space and
> > > speed. It has a compressed representation, and sorts with far less
> > > space than Strings.
> > >
> > > Also you get something called a date facet, which lets you bucketize
> > > facet searches by time block.
> > >
> > > On Tue, Apr 27, 2010 at 1:02 PM, Toke Eskildsen <
> te@statsbiblioteket.dk>
> > > wrote:
> > > > Samarendra Pratap [samarzone@gmail.com] wrote:
> > > >> 1. Our default option is sort by score, however almost 8% of
> searches
> > > use
> > > >> sorting on a field (yyyymmddHHMMSS). This field is indexed as string
> > > (not as
> > > >> NumericField or DateField).
> > > >
> > > > Guessing that the timestamp is practically unique for each document,
> > > sorting by String takes up a bit more than
> > > > 18M * (40 bytes + 2 * "yyyymmddHHMMSS".length() bytes) ~= 1.2 GB of
> RAM
> > > as the Strings are cached. Coupled with the normal overhead of just
> > opening
> > > an index of your size (500MB by your measurements?), I would have
> guessed
> > > that 3600MB would definitely be enough to open the index and do sorted
> > > searches.
> > > >
> > > > I realize that fiddling with production servers is dangerous, but
> > > connecting with JConsole and forcing a garbage collection might be
> > > acceptable? That should enable you to determine whether you're leaking
> > > memory or if it's just the JVM being greedy. I'd guess you leaking
> > though,
> > > as HotSpot does not normally allocate up to the limit if it does not
> need
> > > to.
> > > >
> > > > Anyway, changing to one of the optimized fields for sorting dates
> > should
> > > shave 1 GB off the memory requirement, so I'll recommend doing that no
> > > matter what the main cause of your memory problems is.
> > > >
> > > > Regards,
> > > > Toke Eskildsen
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Lance Norskog
> > > goksron@gmail.com
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Regards,
> > Samar
> >
>



-- 
Regards,
Samar

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message