lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Lucene applicability
Date Tue, 31 Aug 2010 12:05:24 GMT
See below

On Tue, Aug 31, 2010 at 5:17 AM, Schreiner Wolfgang <
Wolfgang.Schreiner@itsv.at> wrote:

> Hi,
>
> Thank you all for your time to answer my questions!
> However there are a few more issues which are not quite clear yet and hope
> to get advice on those too:
>
> 1.) How is the index maintained? In another product where we use an indexer
> different from Lucene, we got one central index and a few JBoss servers all
> accessing the same index. So how does Lucene handle synchronization between
> multiple threads (different JVMs)? How does it maintain the index after
> update/delete operations on the database?
>

It doesn't, you have to do it yourself. You'll have to write some app that
periodically queries your db and when it detecte changes, updates the Lucene
index. You should only have one JVM have an open writer at a time. You can
have as many readers from as many jvms as you want. Note that your single
JVM that has a writer can use multiple write threads. In other words writers
are thread safe..


> 2.) Is the index always up-to-date? In the FAQs it says we have to re-open
> the IndexReader periodically ... how expensive (in computational terms) is
> it to do that on every request for instance?
>

the index is always up to date, how could it be otherwise <G>. When you open
a reader, Lucene essentially takes a snapshot of the index at that moment.
Any updated are not visible to that reader, thus the comment about reopening
the reader.


> 3.) I'm still not sure about performance. According to the FAQs we need to
> build our own MultiPhraseQuery parser to support multiple terms and
> wildcards. For example consider 50.000.000 documents, 50.000 of them match
> term T1 in category A, 50.000 match term T2 in category B and 1.000.000
> match term T3 in category C. 50 Match T1 in A and T2 in B and T3 in C. How
> fast is the algorithm in this case? Who guarantees that it doesn't start at
> the 1 million side?
>
>
That can't be answered in the abstract because there are too many variables,
you have to measure. But there are Lucene installations that have many more
documents than that so you can probably get there. Consider SOLR if you need
to replicate or shard your indexes because they get too big.

I think you're probably thinking of Lucene in terms of an application, not
an engine. Lucene is what you build your application around, it provides the
guts of the searching. The rest is up to your.

HTH
Erick

Cheers,
>
> w
>
>
> -----Original Message-----
> From: Lance Norskog [mailto:goksron@gmail.com]
> Sent: Donnerstag, 26. August 2010 05:25
> To: java-user@lucene.apache.org
> Subject: Re: Lucene applicability
>
> A stepping stone to the above is that, in DB terms, a Lucene index is
> only one table. It has a suite of indexing features that are very
> different from database search. The features are oriented to searching
> large bodies of text for "ideas" rather than concrete words. It
> searches a lot faster than a DB. It also spends more time creating its
> various indexes than a DB. Other points- you can't add or drop fields
> or indexes.
>
> On Wed, Aug 25, 2010 at 10:33 AM, Erick Erickson
> <erickerickson@gmail.com> wrote:
> > The SOLR wiki has lots of good information, start there:
> > http://wiki.apache.org/solr/
> >
> > Otherwise, see below...
> >
> > On Wed, Aug 25, 2010 at 6:20 AM, Schreiner Wolfgang <
> > Wolfgang.Schreiner@itsv.at> wrote:
> >
> >> Hi all,
> >>
> >> We are currently evaluating potential search frameworks (such as
> Hibernate
> >> Search) which might be suitable to use in our project (using Spring, JPA
> >> with Hibernate) ...
> >> I am sending this E-Mail in hope you can advise me on a few issues that
> >> would help us in our decision making process.
> >>
> >>
> >> 1.)    Is Lucene suitable for full text database searches? I read Lucene
> >> was designed to index and search documents but how does it behave
> querying
> >> relational data sets in general?
> >>
> >
> > Let's start be talking about the phrase "full text database searches".
> One
> > thing virtually all db-centric
> > people trip over is trying to use SOLR as if it were a database. You just
> > can't think about tables. The
> > first time you think about using SOLR to do something join-like, stop and
> > take a deep breath and
> > think about documents instead. The general approach is to flatten your
> data
> > so that each "document"
> > contains all the relevant info. Yes, this leads to de-normalization. Yes,
> > denormalized data makes a
> > good DBA cringe. But that's the difference between searching and using a
> > RDBMS.
> >
> > "Document" is somewhat misleading. A document in SOLR terms is just a
> > collection of fields. And, BTW,
> > there's no requirement that each document have the same fields (very
> unlike
> > a DB).
> >
> >
> >>
> >> 2.)    Can we make assumptions on query performance considering combined
> >> searches, range queries or structured data and wildcard searches? If we
> >> consider a data structure consisting of say 3 tables and each table
> contains
> >> a few million entries (e.g. first name, last name and address fields)
> and we
> >> search for common values (such as 'John', 'Smith' and 'New York') where
> >>
> >> a.       each value for itself and each combination would result in
> >> millions of hits
> >>
> >
> > Sure, but what those assumptions are is totally dependent on how you've
> set
> > things up. SOLR has been successfully
> > used on several billion document indexes. There are tools for making all
> > that work (i.e. replication, sharding, etc)
> > built into SOLR. So I suspect you can make things work. Several million
> > documents is not that large a data set.
> >
> > As always, there are tradeoffs between speed and complexity. But from
> what
> > you've described
> > I see no show stoppers.
> >
> >
> >>
> >> b.      a person can have multiple first names and we want to make sure
> to
> >> receive any combination of the last name with any first name
> >>
> >>
> > This just sounds like an OR. But the queries can be pretty complex
> queries.
> > Some examples of what you expect would help.
> > See multi-valued fields. So, a "document" can have multiple "firstname"
> > entries. Again, not like a DB (your reflexes will trip you
> > up on this point <G>).
> >
> >
> >> c.       we search for a last name and a range of birth dates
> >>
> >>
> > Sure, range queries work just fine. Note that dates can trip you up, look
> at
> > triedate if you experiment.
> >
> >
> >> 3.)    Transaction safety: How does Lucene handle indexes? If we update
> >> data model and index, what happens to the index if anything goes wrong
> as
> >> soon as the data model has been persisted?
> >>
> >
> > A lot of work has been done to make SOLR quite robust if "anything goes
> > wrong". That said, how are you backing up your data?
> > That is, what is the source of the data you're going to index? If you're
> > relying on your SOLR index to be your backup, you simply must back it up
> > somewhere "often enough" to get by if your building burns down. I'd also
> > think about storing your original input...
> >
> > This is no different than a DB. you have to guard against the disk
> crashing,
> > someone walking by with a powerful magnet,  earthquake, flood, fires
> > <G>.....
> >
> > Do note that if you modify your index schema, no existing documents
> reflect
> > the new schema, you have to reindex them.
> >
> >
> >>
> >> I hope I made the issues clear to you, just some general thoughts about
> how
> >> Lucene would behave in a real world application scenario ... Any support
> or
> >> pointers to helpful documents or Web links are highly appreciated!
> >> Cheers for now,
> >>
> >> w
> >>
> >>
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message