lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Lucene applicability
Date Thu, 26 Aug 2010 03:24:47 GMT
A stepping stone to the above is that, in DB terms, a Lucene index is
only one table. It has a suite of indexing features that are very
different from database search. The features are oriented to searching
large bodies of text for "ideas" rather than concrete words. It
searches a lot faster than a DB. It also spends more time creating its
various indexes than a DB. Other points- you can't add or drop fields
or indexes.

On Wed, Aug 25, 2010 at 10:33 AM, Erick Erickson
<erickerickson@gmail.com> wrote:
> The SOLR wiki has lots of good information, start there:
> http://wiki.apache.org/solr/
>
> Otherwise, see below...
>
> On Wed, Aug 25, 2010 at 6:20 AM, Schreiner Wolfgang <
> Wolfgang.Schreiner@itsv.at> wrote:
>
>> Hi all,
>>
>> We are currently evaluating potential search frameworks (such as Hibernate
>> Search) which might be suitable to use in our project (using Spring, JPA
>> with Hibernate) ...
>> I am sending this E-Mail in hope you can advise me on a few issues that
>> would help us in our decision making process.
>>
>>
>> 1.)    Is Lucene suitable for full text database searches? I read Lucene
>> was designed to index and search documents but how does it behave querying
>> relational data sets in general?
>>
>
> Let's start be talking about the phrase "full text database searches". One
> thing virtually all db-centric
> people trip over is trying to use SOLR as if it were a database. You just
> can't think about tables. The
> first time you think about using SOLR to do something join-like, stop and
> take a deep breath and
> think about documents instead. The general approach is to flatten your data
> so that each "document"
> contains all the relevant info. Yes, this leads to de-normalization. Yes,
> denormalized data makes a
> good DBA cringe. But that's the difference between searching and using a
> RDBMS.
>
> "Document" is somewhat misleading. A document in SOLR terms is just a
> collection of fields. And, BTW,
> there's no requirement that each document have the same fields (very unlike
> a DB).
>
>
>>
>> 2.)    Can we make assumptions on query performance considering combined
>> searches, range queries or structured data and wildcard searches? If we
>> consider a data structure consisting of say 3 tables and each table contains
>> a few million entries (e.g. first name, last name and address fields) and we
>> search for common values (such as 'John', 'Smith' and 'New York') where
>>
>> a.       each value for itself and each combination would result in
>> millions of hits
>>
>
> Sure, but what those assumptions are is totally dependent on how you've set
> things up. SOLR has been successfully
> used on several billion document indexes. There are tools for making all
> that work (i.e. replication, sharding, etc)
> built into SOLR. So I suspect you can make things work. Several million
> documents is not that large a data set.
>
> As always, there are tradeoffs between speed and complexity. But from what
> you've described
> I see no show stoppers.
>
>
>>
>> b.      a person can have multiple first names and we want to make sure to
>> receive any combination of the last name with any first name
>>
>>
> This just sounds like an OR. But the queries can be pretty complex queries.
> Some examples of what you expect would help.
> See multi-valued fields. So, a "document" can have multiple "firstname"
> entries. Again, not like a DB (your reflexes will trip you
> up on this point <G>).
>
>
>> c.       we search for a last name and a range of birth dates
>>
>>
> Sure, range queries work just fine. Note that dates can trip you up, look at
> triedate if you experiment.
>
>
>> 3.)    Transaction safety: How does Lucene handle indexes? If we update
>> data model and index, what happens to the index if anything goes wrong as
>> soon as the data model has been persisted?
>>
>
> A lot of work has been done to make SOLR quite robust if "anything goes
> wrong". That said, how are you backing up your data?
> That is, what is the source of the data you're going to index? If you're
> relying on your SOLR index to be your backup, you simply must back it up
> somewhere "often enough" to get by if your building burns down. I'd also
> think about storing your original input...
>
> This is no different than a DB. you have to guard against the disk crashing,
> someone walking by with a powerful magnet,  earthquake, flood, fires
> <G>.....
>
> Do note that if you modify your index schema, no existing documents reflect
> the new schema, you have to reindex them.
>
>
>>
>> I hope I made the issues clear to you, just some general thoughts about how
>> Lucene would behave in a real world application scenario ... Any support or
>> pointers to helpful documents or Web links are highly appreciated!
>> Cheers for now,
>>
>> w
>>
>>
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message