lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Hadfield <m...@animarc.com>
Subject Re: use Lucene to index sentences
Date Mon, 06 Feb 2006 23:09:24 GMT

Hi AJ -
Performance would depend on the kind of queries you are going to perform 
against sentences.  If you are going to be querying for phrases 
(multi-token), want to make use of stemming, or any kind of term 
expansion (wildcare, synonyms, etc), I imagine lucene would be much 
superior, but I don't have experience with mysql full text search.
---marc

AJ Chen wrote:

>Hi Marc,
>Thanks for your suggestions. Marking sentences in documents and using span
>query is a good approach.  How do you compare its performance to a database
>approach? For example, sentences can be stored in mysql, one sentence per
>row, and they can be searched by mysql's full text search feature. Using
>database, it will be also easy to tell which document the matched sentence
>belongs to.
>
>AJ
>
>On 2/6/06, Marc Hadfield <marc@animarc.com> wrote:
>  
>
>>Hi AJ -
>>
>>Depending on your need, you could create a lucene document for each
>>sentence (in which case searching and returning sentences is trivial),
>>or create a lucene document for each of your documents, with embedded
>>sentence start/stop markers (as a special symbol).  or, instead of a
>>special symbol, you can increase the token count after each
>>end-of-sentence so that there is a large gap inbetween sentences -- this
>>will give higher scores to intra-sentence matches.
>>
>>if you insert special sentence marker symbols, then you could use a span
>>search to guarantee that a phrase happens inside a sentence.  when a
>>match occurs, you can use the document's termpositionvector object to
>>re-create the original sentence, or alternatively, use the embedded
>>sentence number in lucene (perhaps symbols like "__sentence_start" and
>>"__sentence_num_20") to grab the original sentence from a file
>>containing the full text with sentence markers (perhaps xml tags:
>>"<sentence num=20>").
>>
>>I use the techniques such as the above for a very large lucene index of
>>documents with embedded sentence markers.  There are various trade-offs
>>in terms of index size (how much info to keep in index), expected query
>>performance, and so on.
>>
>>---marc hadfield
>>
>>
>>
>>AJ Chen wrote:
>>
>>    
>>
>>>I'll appreciate any advice on whether Lucene is appropriate for
>>>      
>>>
>>index/search
>>    
>>
>>>sentences.  I have millions of documents broken down into millions of
>>>sentences. Each sentence does not exist as a document.  All these
>>>      
>>>
>>sentences
>>    
>>
>>>are in a small number of big files. How can I use Lucene to index/search
>>>      
>>>
>>the
>>    
>>
>>>sentences? Search will return which sentence matches the query.  If
>>>      
>>>
>>Lucene
>>    
>>
>>>does not do it, any better approach besides using mysql database?
>>>
>>>Thanks,
>>>AJ
>>>
>>>
>>>
>>>      
>>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>    
>>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message