lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danil Ε’ORIN <torin...@gmail.com>
Subject Re: Use multiple lucene indices
Date Tue, 06 Dec 2011 10:05:51 GMT
How many documents there are in the system ?
approximate it by: 20000 files * avg(docs/file)

>From my understanding your queries will be just lookup for a document ID
(Q: are those IDs unique between files? or you need to filter by filename?)
If that will be the only usecase than maybe you should consider some other
lookup systems, a ehcache offloaded and persistent on disk might work just
as well.

If you are anywhere < 200 mln documents I'd say you should go with a single
index that contains all the data on a decent box (2-4 CPU, 4-8Gb RAM)
In a slightly beefier host and Lucene4 (try various codecs for speed/memory
usage) I think you could go to 1 bln documents.

If you plan on more complex queries..like given a position in a file,
identify a document that contains it...than the number of documents should
be reconsidered.

In worst case case scenario I would go with partitioned index (5-10
partitions, but not thousands)


On Tue, Dec 6, 2011 at 11:03, Rui Wang <rwang@ebi.ac.uk> wrote:

> Hi Guys,
>
> Thank you very much for your answers.
>
> I will do some profiling on memory usage, but is there any documentation
> on how Lucene uses/allocates the memory?
>
> Best wishes,
> Rui Wang
>
>
> On 6 Dec 2011, at 06:11, KARTHIK SHIVAKUMAR wrote:
>
> > hi
> >
> >>> would the memory usage go through the roof?
> >
> > Yup ....
> >
> > My past experience got me pickels  in there...
> >
> >
> >
> > with regards
> > karthik
> >
> > On Mon, Dec 5, 2011 at 11:28 PM, Rui Wang <rwang@ebi.ac.uk> wrote:
> >
> >> Hi All,
> >>
> >> We are planning to use lucene in our project, but not entirely sure
> about
> >> some of the design decisions were made. Below are the details, any
> >> comments/suggestions are more than welcome.
> >>
> >> The requirements of the project are below:
> >>
> >> 1. We have  tens of thousands of files, their size ranging from 500M to
> a
> >> few terabytes, and majority of the contents in these files will not be
> >> accessed frequently.
> >>
> >> 2. We are planning to keep less accessed contents outside of our
> database,
> >> store them on the file system.
> >>
> >> 3. We also have code to get the binary position of these contents in the
> >> files. Using these binary positions, we can quickly retrieve the
> contents
> >> and convert them into our domain objects.
> >>
> >> We think Lucene provides a scalable solution for storing and indexing
> >> these binary positions, so the idea is that each piece of the content in
> >> the files will a document, each document will have at least an ID field
> to
> >> identify to content and a binary position field contains the starting
> and
> >> stop position of the content. Having done some performance testing, it
> >> seems to us that Lucene is well capable of doing this.
> >>
> >> At the moment, we are planning to create one Lucene index per file, so
> if
> >> we have new files to be added to the system, we can simply generate a
> new
> >> index. The problem is do with searching, this approach means that we
> need
> >> to create an new IndexSearcher every time a file is accessed through our
> >> web service. We knew that it is rather expensive to open a new
> >> IndexSearcher, and are thinking of using some kind of pooling mechanism.
> >> Our questions are:
> >>
> >> 1. Is this one index per file approach a viable solution? What do you
> >> think about pooling IndexSearcher?
> >>
> >> 2. If we have many IndexSearchers opened at the same time, would the
> >> memory usage go through the roof? I couldn't find any document on how
> >> Lucene use allocate memory.
> >>
> >> Thank you very much for your help.
> >>
> >> Many thanks,
> >> Rui Wang
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > *N.S.KARTHIK
> > R.M.S.COLONY
> > BEHIND BANK OF INDIA
> > R.M.V 2ND STAGE
> > BANGALORE
> > 560094*
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message