lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Using lucene as a database... good idea or bad idea?
Date Tue, 29 Jul 2008 14:25:31 GMT
Don't connect "database" (i.e. SQL, transactions, etc.) and Lucene.   
Connect data storage with simple, fast lookup and Lucene.

One field is the key (i.e. the filename) the other field is a binary,  
stored Field containing the contents of the file.  Of course, there  
are other ways of slicing and dicing, such that one can search (in the  
fuzzy sense) the content and the key by adding tokenization, etc.   
This is the more traditional model for Lucene

Also, have a look at Apache Jackrabbit.  It is a content repository  
that is implemented with Lucene.

-Grant

On Jul 29, 2008, at 10:02 AM, ನಾಗೇಶ್  
ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S) wrote:

> Hi Ian,
> Yes, I see that we are discussing an "option" here.
>
> But, as I said before (the three parts to search-based solution), I  
> do not
> know (but, would like to know) how Lucene (java only - not Nutch,  
> Solr,
> etc.) can be used as a datastore.
>
> Basically, I am not able to connect "database" and Lucene java. :)
>
> Nagesh
>
>
> On Tue, Jul 29, 2008 at 6:51 PM, Ian Lea <ian.lea@gmail.com> wrote:
>
>> I don't think that anyone in this thread has said "should", just
>> "could" - it is a valid option (IMHO).  Personally, I use it as a
>> store for lucene related data because I know and like and trust it,  
>> it
>> is already there for this project so no need to introduce another
>> software dependency, and because it is blindingly fast.
>>
>>
>> --
>> Ian.
>>
>>
>> On Tue, Jul 29, 2008 at 1:43 PM, ನಾಗೇಶ್  
>> ಸುಬ್ರಹ್ಮಣ್ಯ (Nagesh S)
>> <nageshblore@gmail.com> wrote:
>>> The way I see it, search solutions (on whatever scale) have three
>> components
>>> - data aggregation, indexing/searching and presentation of  
>>> results. I
>>> thought, Lucene did the second part only.
>>>
>>> So, I do not quite follow, why should Lucene be used for datastore ?
>>>
>>> Nagesh
>>>
>>> On Tue, Jul 29, 2008 at 6:01 PM, Grant Ingersoll  
>>> <gsingers@apache.org
>>> wrote:
>>>
>>>> I think the answer is it can be done and probably quite well.  I  
>>>> also
>> think
>>>> it's informative that Nutch does not use Lucene for this  
>>>> function, as I
>>>> understand it, but that shouldn't stop you either.  You might  
>>>> also have
>> a
>>>> look at Apache Jackrabbit, which uses Lucene underneath as a  
>>>> content
>>>> repository.
>>>>
>>>> -Grant
>>>>
>>>>
>>>> On Jul 29, 2008, at 5:34 AM, Ganesh - yahoo wrote:
>>>>
>>>> Hello all,
>>>>>
>>>>> I am also interested in this. I want to archive the content of the
>>>>> document using Lucene.
>>>>>
>>>>> Is it a good idea to use Lucene as storage engine?
>>>>>
>>>>> Regards
>>>>> Ganesh
>>>>>
>>>>> ----- Original Message ----- From: "Ian Lea" <ian.lea@gmail.com>
>>>>> To: <java-user@lucene.apache.org>
>>>>> Sent: Tuesday, July 29, 2008 2:18 PM
>>>>> Subject: Re: Using lucene as a database... good idea or bad idea?
>>>>>
>>>>>
>>>>> John
>>>>>>
>>>>>>
>>>>>> I think it's a great idea, and do exactly this to store 5  
>>>>>> million+
>>>>>> documents with info that it takes way too long to get out of our
>>>>>> Oracle database (think days).  Not as many docs as you are  
>>>>>> talking
>>>>>> about, and less data for each doc, but I wouldn't have any  
>>>>>> concerns
>>>>>> about scaling.  There are certainly lucene indexes out there  
>>>>>> bigger
>>>>>> than what you propose.  You can compress the stored data to  
>>>>>> save some
>>>>>> space.  Run times for optimization might get interesting but see
>>>>>> recent threads for suggestions on that.  And since you are not  
>>>>>> too
>>>>>> concerned about performance you may not need to optimize much,  
>>>>>> or even
>>>>>> at all.
>>>>>>
>>>>>> Of course you need to remember that this is not a DBMS solution 

>>>>>> in the
>>>>>> sense of transactions, recovery, etc. but I'm sure you are  
>>>>>> already
>>>>>> aware of that.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ian.
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 29, 2008 at 2:53 AM, John Evans <john@jpevans.com>
 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I have successfully used Lucene in the "tradtiional" way to 

>>>>>>> provide
>>>>>>> full-text search for various websites.  Now I am tasked with
>> developing
>>>>>>> a
>>>>>>> data-store to back a web crawler.  The crawler can be  
>>>>>>> configured to
>>>>>>> retrieve
>>>>>>> arbitrary fields from arbitrary pages, so the result is that
 
>>>>>>> each
>>>>>>> document
>>>>>>> may have a random assortment of fields.  It seems like Lucene
 
>>>>>>> may be
>> a
>>>>>>> natural fit for this scenario since you can obviously add  
>>>>>>> arbitrary
>>>>>>> fields
>>>>>>> to each document and you can store the actually data in the 

>>>>>>> database.
>>>>>>> I've
>>>>>>> done some research to make sure that it would meet all of our
>> individual
>>>>>>> requirements (that we can iterate over documents, update
>>>>>>> (delete/replace)
>>>>>>> documents, etc.) and everything looks good.  I've also seen a
 
>>>>>>> couple
>> of
>>>>>>> references around the net to other people trying similar  
>>>>>>> things...
>>>>>>> however,
>>>>>>> I know it's not meant to be used this way, so I thought I  
>>>>>>> would post
>>>>>>> here
>>>>>>> and ask for guidance?  Has anyone done something similar?  Is
 
>>>>>>> there
>> any
>>>>>>> specific reason to think this is a bad idea?
>>>>>>>
>>>>>>> The one thing that I am least certain about his how well it will
>> scale.
>>>>>>> We
>>>>>>> may reach the point where we have tens of millions of  
>>>>>>> documents and a
>>>>>>> high
>>>>>>> percentage of those documents may be relatively large (10k-50k
 
>>>>>>> each).
>>>>>>> We
>>>>>>> actually would NOT be expecting/needing Lucene's normal  
>>>>>>> extreme fast
>>>>>>> text
>>>>>>> search times for this, but we would need reasonable times for
 
>>>>>>> adding
>> new
>>>>>>> documents to the index, retrieving documents by ID (for  
>>>>>>> iterating
>> over
>>>>>>> all
>>>>>>> documents), optimizing the index after a series of changes, etc.
>>>>>>>
>>>>>>> Any advice/input/theories anyone can contribute would be greatly
>>>>>>> appreciated.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -
>>>>>>> John
>>>>>>>
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>
>>>>> Send instant messages to your online friends
>>>>> http://in.messenger.yahoo.com
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com
>>>>
>>>> Lucene Helpful Hints:
>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message