lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "howard chen" <howac...@gmail.com>
Subject Re: [Interesting Question] How to implement Indexes Grouping?
Date Sat, 16 Dec 2006 16:27:52 GMT
On 12/16/06, Erick Erickson <erickerickson@gmail.com> wrote:
> You can't tell until you get some numbers. So try it. I'm indexing 4,600
> books in about 45 minutes on a laptop as part of my current project. So it
> shouldn't be much of a problem to index, say, 10,000 books as a starter set.
> This will give you some idea of the size of your index(es), and some idea of
> the performance. You're almost required to do this since nobody can answer
> performance questions in the abstract. It depends.... how much are you
> indexing? What is your index structure? etc. etc. etc.
>
> Be aware that Lucene indexes the first 10,000 tokens by default. You can
> make this as large as you want, but you have to do this consciously.
>
> It should take you less than a day to create a test harness that fires off N
> threads at your searcher to measure load. I can't emphasize enough how
> valuable this will be as you design your system.
>
> Changing from a single index to a distributed one isn't difficult, see
> Multisearcher. Partitioning the index is something you'll have to do anyway
> and I'd build it on multiple instances of a simple indexer, so starting with
> the simple, single index case doesn't waste any time.
>
> You need to answer some questions for yourself... stored or unstored text?
> How many other fields do you want to store and/or index? What is acceptable
> performance?
>
> My point is that you can get quite a ways with a very simple design, without
> doing much in the way of throw-away work. And the answers you get from the
> simple case will give you actual data to make further decisions. Otherwise,
> you risk making a complex solution that you don't need. Do you have any
> basis at all for estimating that 20 subgroups is sufficient and necessary?
>
> Your goal here is to get the answer for your final design as quickly as
> possible. At the same time, you want to waste as little time writing code
> that you'll discard later. So try the simple case on a test data set. This
> will get your index design into a firmer state and you can load-test it with
> your presumed load and get actual data for your system. Until you do this,
> any answer you have is just a guess.
>
> Best
> Erick
>
> On 12/16/06, howard chen <howachen@gmail.com> wrote:
> >
> > On 12/16/06, Erick Erickson <erickerickson@gmail.com> wrote:
> > > I'd start with just one big index and test <G>. My point is that you
> > can't
> > > speculate. The first question you have to answer is "is searching the
> > whole
> > > index fast enough given my architecture?" and we can't answer that. Nor
> > can
> > > you until you try.......
> > >
> > > We especially can't speculate since you've provided no clue how many
> > users
> > > you're talking about. 10? 1,000,000? How many books do you expect them
> > to
> > > own? 10? 100,000? I can't imagine separate indexes for 1M users each
> > owning
> > > all 1000 books. I can imagine it for 10 users owning 100 books.....
> > >
> > > Assuming that you get decent performance in a single index, I'd create a
> > > filter at query time for a user. The filter has the bits turned on for
> > the
> > > books the user owns and include the filter as part of a BooleanQuery
> > when I
> > > searched the text. The filters could even be permanently stored rather
> > than
> > > created each time, but I'd save that refinement for later.....
> > >
> > > Note that if you do store a filter, they are quite small. 1 bit per book
> > (+
> > > very small overhead)....
> > >
> > > Best
> > > Erick
> > >
> > > On 12/16/06, howard chen <howachen@gmail.com> wrote:
> > > >
> > > > Consider the following interesting situation,
> > > >
> > > > A library has around 100K book, and want to be indexed by Lucene, this
> > > > seems to be straight forward, but....
> > > >
> > > > The target is:
> > > >
> > > > 0. You can search all books in the whole library [easy, just index it]
> > > >
> > > > 1. users in this system can own a numbers of books in their personal
> > > > bookshelf, the users might only want to search book in their bookshelf
> > > > ONLY.
> > > >
> > > > 2. if each users own a copy of the index of their personal bookshelf,
> > > > this seems to be waste of storage space as books are shared by many
> > > > users.
> > > >
> > > > 3. If no matter users own what book, the whole indexes is to be
> > > > searched, this seems to be waste of computation power if he just own a
> > > > few books only.
> > > >
> > > >
> > > > In this situation, how would you design a indexing + search system?
> > > >
> > > > Any idea can share?
> > > >
> > > > :)
> > > >
> > > > Thanks.
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > >
> >
> > I agree that filter is a way of implement it. My concern is that with
> > such big index, say 100K book full text indexed, this will become the
> > bottom neck and it is difficult to distribute the indexing and
> > searching.
> >
> > My initial thinking is to group the index by Call. No, say to divide
> > 100K books into 20 subgroups, and when user search it, it will create
> > 20 threads to search for the book in different servers.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>

Thanks for your help, really useful!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message