lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Hit highlighting for non-english unicode index/queries not working?
Date Tue, 26 May 2009 15:08:38 GMT
LowercaseFilter is part of Lucene, as are any number of other filters. Thebasic
idea is just that *after* tokenization, there may be further
transformations you want to do on each token, such as lower-casing
it, stemming it, skipping it, <whatever else you might like to do>....

But watch out a bit, there are token Filters and search Filters, and they
have nothing to do with each other <G>.

Erick

On Tue, May 26, 2009 at 9:51 AM, KK <dioxide.software@gmail.com> wrote:

> Thank you Erick.
> As of now I'm using whitespaceanalyzer and no stemming and not stop word
> remova. Now I feel writing a simple analyzer won't be that difficult after
> going thru your mail. I'll give it a try. I don't have any idea on filters
> but I'm pretty it must be simple and will definitely go through the
> examples
> of LIA 2ndEdn. Thank you.
>
> --KK
>
> On Tue, May 26, 2009 at 6:55 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > It's fairly easy to construct your own analyzer bystringing together some
> > filters and tokenizers. LIA (1st ed)
> > had a SynonymAnalyzer. You probably want something like
> > (WARNING, example only, I'm not even sure it compiles!! Ripped
> > off  from the WIKI)
> >
> > public class MyAnalyzer extends Analyzer
> > {
> >    public TokenStream tokenStream (String field, final Reader reader) {
> >            return  new LowercaseFilter (new WhitespaceTokenizer(reader));
> >   }
> > }
> >
> > There are a number of Filters you can string together if you want to,
> say,
> > remove stop words etc..
> >
> > HTH
> > Erick
> >
> > On Tue, May 26, 2009 at 6:38 AM, KK <dioxide.software@gmail.com> wrote:
> >
> > > Thank you @Muir.
> > > I was earlier using simpleanalyzer for all purposes but as you
> > reccomended
> > > me the whitespace one, I tried to use that analyzer and good thing is
> > that
> > > I'm able to index/search non-english text as well as supporting hit
> > > highlighting for these non-english texts. Thank you very much.
> > > But now there is one silly problem. As whitespaceanalyzer doesnot do
> > > anything other than separating the tokens based on the space, for
> english
> > > pages case-folding is getting missed. Unless I provide the exact words
> > > including the right cases it doesnot give me results, which is quite
> > > obivious. As I went thru the LIA 2nd Edn book, found that it mentions
> we
> > > can
> > > use analyzers on document level and also on field level. I was quite
> > amazed
> > > at the granularity of analysis supported by Lucene. But its there we
> just
> > > have to make use of it. So I'm thinking of giving it a try that will
> help
> > > me
> > > support  both english and non-english indexing/searching/highlighting.
> > > Thank
> > > you all. Any ideas on the same are always welcome.
> > >
> > > Thanks,
> > > KK.
> > >
> > >
> > > On Tue, May 26, 2009 at 1:24 AM, Robert Muir <rcmuir@gmail.com> wrote:
> > >
> > > > as mentioned previously, i dont think your text is being analyzed the
> > way
> > > > you want.
> > > >
> > > > SimpleAnalyzer will break your word
> > \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
> > > > (பரிணாம) into 3 tokens:
> > > >
> > > > \u0BAA\u0BB0
> > > > \u0BA3
> > > > \u0BAE
> > > >
> > > > Not only does it incorrectly split your word into three words, but it
> > > > completely drops the dependent vowels (\u0BBF and \u0BBE).
> > > >
> > > > This is why i would recommend trying whitespace analyzer instead.
> > > > Also take a look at the Luke index tool, its a very quick way to see
> > how
> > > > your words are being analyzed by various analyzers.
> > > >
> > > >
> > > > On Mon, May 25, 2009 at 10:02 AM, KK <dioxide.software@gmail.com>
> > wrote:
> > > >
> > > > > Hi,
> > > > > I'm trying to index some non-english texts. Indexing and searching
> is
> > > > > working fine. From command line I'm able to provide the utf-8
> > unicoded
> > > > text
> > > > > as input like this,
> > > > > \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
> > > > > and able to get the search results.
> > > > > Then I tried to add hit highlighting for the same. So I started
> with
> > > > simple
> > > > > english texts and used pharse queries for providing input queries.
> My
> > > > code
> > > > > looks like this,
> > > > >
> > > > >
> > > > > import java.io.FileReader;
> > > > > import java.io.IOException;
> > > > > import java.io.InputStreamReader;
> > > > > import java.util.Date;
> > > > > import java.io.*;
> > > > > import java.nio.charset.Charset;
> > > > >
> > > > > import org.apache.lucene.analysis.Analyzer;
> > > > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > > > import org.apache.lucene.document.Document;
> > > > > import org.apache.lucene.index.FilterIndexReader;
> > > > > import org.apache.lucene.index.IndexReader;
> > > > > import org.apache.lucene.index.Term;
> > > > > import org.apache.lucene.queryParser.QueryParser;
> > > > > import org.apache.lucene.search.HitCollector;
> > > > > import org.apache.lucene.search.Hits;
> > > > > import org.apache.lucene.search.IndexSearcher;
> > > > > import org.apache.lucene.search.Query;
> > > > > import org.apache.lucene.search.PhraseQuery;
> > > > > import org.apache.lucene.search.ScoreDoc;
> > > > > import org.apache.lucene.search.Searcher;
> > > > > import org.apache.lucene.search.TopDocCollector;
> > > > > import org.apache.lucene.search.highlight.Highlighter;
> > > > > import org.apache.lucene.search.highlight.QueryScorer;
> > > > > import org.apache.lucene.search.Scorer;
> > > > > import org.apache.lucene.analysis.TokenStream;
> > > > > import org.apache.lucene.analysis.SimpleAnalyzer;
> > > > >
> > > > >
> > > > > /** Simple command-line based search demo. */
> > > > > public class LuceneSearcher {
> > > > >    private static final String indexPath = "/opt/lucene/index" +
> > > > "/core36";
> > > > > //core36 refers to the exact index directory for tamil pages
> > > > >
> > > > >    private void searchIndex(String terms) throws Exception{
> > > > >        String queryString = "";
> > > > >        PhraseQuery phrase = new PhraseQuery();
> > > > >        String[] termArray = terms.split(" ");
> > > > >        for (int i=0; i<termArray.length; i++) {
> > > > >            System.out.println("adding " + termArray[i]);
> > > > >            //phrase.add(new Term("content", termArray[i]));
> > > > >            //queryString += termArray[i];
> > > > >        }
> > > > >        /
> > > > >        //phrase.add(new Term("content", "ubuntu"));
> > > > >        String tamilQuery = new
> > > > > String("\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0");
> > > > >        //tamilQuery = new String("ubuntu");
> > > > >        phrase.add(new Term("content", tamilQuery));
> > > > >        phrase.setSlop(1);
> > > > >        System.out.println("phrase query " + phrase.toString());
> > > > >
> > > > >         IndexSearcher searcher = new IndexSearcher(indexPath);
> > > > >        QueryParser queryParser = null;
> > > > >        try {
> > > > >            queryParser = new QueryParser("content", new
> > > > SimpleAnalyzer());
> > > > >        } catch (Exception ex) {
> > > > >             ex.printStackTrace();
> > > > >        }
> > > > >
> > > > >        //Query query = queryParser.parse(queryString);
> > > > >
> > > > >        Hits hits = null;
> > > > >        try {
> > > > >             hits = searcher.search(phrase);
> > > > >        } catch (Exception ex) {
> > > > >             ex.printStackTrace();
> > > > >        }
> > > > >        //for highlighter section
> > > > >        QueryScorer scorer = new QueryScorer(phrase);
> > > > >        Highlighter highlighter = new Highlighter(scorer);
> > > > >
> > > > >        for (int i = 0; i < hits.length(); i++) {
> > > > >            String content = hits.doc(i).get("content");
> > > > >            TokenStream stream = new
> > > > SimpleAnalyzer().tokenStream("content",
> > > > > new StringReader(content));
> > > > >            String fragment = highlighter.getBestFragments(stream,
> > > > content,
> > > > > 5, "...");
> > > > >            System.out.println(fragment);
> > > > >        }
> > > > >
> > > > >
> > > > >        int hitCount = hits.length();
> > > > >        System.out.println("Results found :" + hitCount);
> > > > >
> > > > >        /*
> > > > >        for (int ix=0; ix<hitCount; ix++) {
> > > > >             Document doc = hits.doc(ix);
> > > > >            System.out.println(doc.get("content"));
> > > > >        }
> > > > >        */
> > > > >    }
> > > > >
> > > > >    public static void main(String args[]) throws Exception{
> > > > >         LuceneSearcher searcher = new LuceneSearcher();
> > > > >        String termString = args[0];
> > > > >        System.out.println("searching for " + args[0]);
> > > > >        searcher.searchIndex(termString);
> > > > >    }
> > > > >
> > > > > }
> > > > > ----------------------code ends
> here---------------------------------
> > > > > NB: Please ignore basic coding conventio[ indentations, comments
> > etc].
> > > > You
> > > > > might find some unneccesary code intermixed with the highlighting
> > code,
> > > > > ignore them .
> > > > >
> > > > > Now when I searched for some english docs I got the results with
> > > <b></b>
> > > > > tags sorrounding the hits like this,
> > > > >
> > > > > <B>Ubuntu</B> Press Releases Media Contact <B>Ubuntu</B>
News Home
> > > > > <B>Ubuntu</B> Security NoticesThese are the <B>Ubuntu</B>
security
> > > > notices
> > > > > that affect the current supported releases of <B>Ubuntu</B>.
These
> > > > notices
> > > > > are also posted
> > > > >
> > > > > Now I thought of testing the same for temil texts. Before this I
> > would
> > > > like
> > > > > to add one more information that prior to adding the codes for
> > > > highlighting
> > > > > I was able to search a lucene index from the command line using the
> > raw
> > > > > unicode texts like this,
> > > > > [kk@kk-laptop]$ java LuceneSearcher
> > > > "\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0"
> > > > >
> > > > > and it gives me the page that mathces the above query. Now I tried
> to
> > > do
> > > > > the
> > > > > same alongwith highliting. So in the code I posted above you can
> see
> > > that
> > > > I
> > > > > commented out the english terms and added one tamil unicode query
> and
> > > > tried
> > > > > to see If it gives me the same result that I was getting prior to
> > > > > highlighting and found that I'm not getting any results. This might
> > be
> > > > > because the query I'm forming using these unicode texts is wrong,
> or
> > > may
> > > > be
> > > > > something else. I'm not able to figure out what exactly is going
> > wrong?
> > > > > Some
> > > > > silly mistake I guess, still I'm not able to find out. Can some one
> > > take
> > > > > the
> > > > > pain to go throgh the above code and find out whats wrong. Thank
> you
> > > very
> > > > > much.
> > > > >
> > > > > Thanks,
> > > > > KK.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcmuir@gmail.com
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message