lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Questions Lucene
Date Tue, 11 Sep 2007 00:44:42 GMT

On Sep 10, 2007, at 7:56 PM, Durga.Tirunagari@Sun.COM wrote:
>        1) What are the various languages supported by Lucene.?  
> Looks like its able to handle only English . We are trying to see  
> if it works with Japanese / Chinese and other characters
>             Can some one answer

Lucene internally uses UTF-8 (the Java modified version) so you won't  
have any encoding issues.  And everything is just text inside the  
index, so no problem with Chinese, Japanese, or any other language  
I've encountered - but certainly there are language-specific  
considerations such as stemming, stop word removal, and whether to do  
anything special to tokenize on "words" in non-whitespace-separated  
languages such as Chinese or use n-gramming, or just simple character  
tokenization.

>        2) After Lucene indexes a given data set, how does Lucene  
> handle incremental / dymanic change in the data. In other words,  
> our data keeps changing ; how
>            does Lucene handle this changing data. Does it re-index  
> every new file entering this data set ?. Or Does it do it index the  
> data in increments ?

There is really no such thing as an "update" operation, so the  
application is responsible for effecting that with a delete and re- 
add on a per-document basis.

>       3) How does Lucene handle deleted files from a particular  
> data set ?. What we are concerned is that, does Lucene  
> automatically figure out if a particular file is deleted from the  
> data set ?.
>          and it immediately removes the index to that particular  
> file ?
>            4) Please consider the following Scenario. When Lucene  
> is given the following files to Index.
>
>          a) Files under /xyz/abc ( Say x.txt, y.txt, a.txt, b.txt,  
> c.txt etc.. )
>                  b) Files under /def/ghi ( Say none.txt, dude.txt,  
> hello.txt etc.. )
>            So after Lucene finished indexing these file under these  
> two directories. And a subsequent search for say a "key word" in  
> hello.txt is made
>          What does Lucene return; does it return i.e the fully  
> qualified location of this file ? /def/ghi/hello.txt

Lucene is about text, not files per se.  It is your application that  
will map that kind of logic on top of Lucene.  Lucene itself knows  
nothing of the files you want to index, delete, search - you will  
build that mapping in yourself.  Your application will be responsible  
for keeping data and the index in sync.

>            5) How does Lucene index a particular set of files. I.e  
> *based* on key words ?. Based on sentences ? Based on what criterion ?

Again, it doesn't deal with "files"... your application deals with  
that, Lucene is handed text.  As for how it makes words in text  
searchable - read up on Lucene Analyzers.  They break the text into  
searchable terms.

>       6) is Lucene multi-threaded ?. For example if Lucene is  
> indexing a set of files in a given data set, and for example if  
> there is a Huge file ( 2 GB file ). Does Lucene index this file in  
> parts (i.e parallely            i.e in multi-threaded fashion ? or  
> does it index this file sequentially

Lucene is isn't multi-threaded, but most operations are thread-safe  
so you can parallelize your application to index multiple documents  
simultaneously, for example.  You may be able to parallelize the  
parsing of those huge files but you'd need to bring that together  
into a single Document instance to hand to Lucene's IndexWriter.

>      7) Also if a data set has multiple files, does Lucene process  
> each file seperately in a different thread ? or does it do it  
> sequentially

Again, this is up to your application entirely.

>      8) Does lucene index only text files ?. We have few data bases  
> is it possible for us to Index the data in these data bases ?

See above :)   All Lucene cares about is text.  How you get text to  
it matters not to Lucene.

>      9) Are there any performance Bench Marks for Lucene

There is a benchmarker framework built into the trunk codebase  
suitable for making your own.  There's some stuff here: http:// 
lucene.apache.org/java/docs/benchmarks.html  and some good stuff  
linked from http://wiki.apache.org/lucene-java/BasicsOfPerformance  
that should get you started.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message