lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Carlson <carl...@bookandhammer.com>
Subject Re: Partial word search with unicode contents
Date Tue, 04 Jun 2002 14:31:13 GMT
Lucene will look for exact matches at its base. However, between the query
string and actually matching searches there is an analyzer that may
manipulate the query. You may have to create an devnagari(hindi) which
correctly tokenizes the terms.

Not that Lucene saves all terms in unicode and will compare them as
correctly has Java compares them.

One other problem I have had, and seen others do is not import the data into
a Java String correctly, so the analyzer and indexer never see the correct
terms.

One way I have used to debug this kind of problem is to look at the terms
the analyzer created and were added to the index. I do this with this hacked
up JSP page (see below).


I hope this helps.


--Peter



<%@page contentType="text/html"%>
<%@page import="org.apache.lucene.index.IndexReader"%>
<%@page import="org.apache.lucene.index.TermEnum"%>
<%@page import="org.apache.lucene.index.Term"%>
<html>
<head><title>View terms</title></head>
<body>
View Terms
<%
    String indexPath = application.getRealPath("/")+"data/XMLIndex.idx";
    IndexReader ir = IndexReader.open(indexPath);
    out.println("Total docs = "+ir.numDocs());
    out.println("<TABLE><TR><TH>term</TH><TH>freq</TH></TR>");
    TermEnum te = ir.terms();
    while (te.next()){
        Term term = te.term();
        int docFreq = te.docFreq();
        if (term.field().compareTo("text")== 0 ||
term.field().compareTo("title") == 0) {
            
out.println("<TR><TD>"+term.field()+":"+term.text()+"</TD><TD>"+docFreq+"</T
D></TR>");
        }
    }
    out.println("</TABLE>");
    te.close();
    ir.close();
%>
</body>
</html>
 


On 6/4/02 1:48 AM, "Harpreet S Walia" <harpreet@sansuisoftware.com> wrote:

> Hi,
> 
> We are using lucene to index and search unicode(utf-8) contents in
> devnagari(hindi) language .
> 
> What we have observed is that our query fetches results which have partial
> word match . i.e if it were english then a query "india" would relurn words
> like
> indian , southindia and so on.
> 
> Is there a way by which we can instruct lucene to only search complete words
> and not word parts.
> 
> TIA
> 
> Regards
> harpreet
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
> 
> 


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message