lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean" <spaceh...@foxmail.com>
Subject Re: Analyzer
Date Thu, 02 Dec 2010 10:11:03 GMT
By the way, is there an analyzer which splites each letter of a word?
e.g.
hello world => h/e/l/l/o/w/o/r/l/d

Regards,
Sean
 
 
------------------ Original ------------------
From:  "Erick Erickson"<erickerickson@gmail.com>;
Date:  Tue, Nov 30, 2010 09:07 PM
To:  "java-user"<java-user@lucene.apache.org>; 

Subject:  Re: Analyzer

 
WhitespaceAnalyzer does just that, splits the incoming stream on
white space.

From the javadocs for StandardAnalyzer:

A grammar-based tokenizer constructed with JFlex

This should be a good tokenizer for most European-language documents:

   - Splits words at punctuation characters, removing punctuation. However,
   a dot that's not followed by whitespace is considered part of a token.
   - Splits words at hyphens, unless there's a number in the token, in which
   case the whole token is interpreted as a product number and is not split.
   - Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not
suit your application, please consider copying this source code directory to
your project and maintaining your own grammar-based tokenizer.


Best

Erick

On Tue, Nov 30, 2010 at 12:06 AM, manjula wijewickrema
<manjula53@gmail.com>wrote:

> Hi Steve,
>
> Thanx a lot for your reply. Yes there are only two classes and it's corrcet
> that the way you have realized the problem. As you have instructed, I
> checked WhitespaceAnalyzer for querying (instead of StandardAnalyzer) and
> it
> seems to me that it gives better results rather than StandardAnalyzer. So
> could you please let me know what are the differences between
> StandardAnalyzer and WhitespaceAnalyzer. I highly appriciate your response.
> Thanx.
>
> Manjula.
>
>
> On Mon, Nov 29, 2010 at 7:32 PM, Steven A Rowe <sarowe@syr.edu> wrote:
>
> > Hi Manjula,
> >
> > It's not terribly clear what you're doing here - I got lost in your
> > description of your (two? or maybe four?) classes.  Sometimes things are
> > easier to understand if you provide more concrete detail.
> >
> > I suspect that you could benefit from reading the book Lucene in Action,
> > 2nd edition:
> >
> >   http://www.manning.com/hatcher3/
> >
> > You would also likely benefit from using Luke, the Lucene index browser,
> to
> > better understand your indexes' contents and debug how queries match
> > documents:
> >
> >   http://code.google.com/p/luke/
> >
> > I think your question is whether you're using Analyzers correctly.  It
> > sounds like you are creating two separate indexes (one for each of your
> > classes), and you're using SnowballAnalyzer on the indexing side for both
> > indexes, and StandardAnalyzer on the query side.
> >
> > The usual advice is to use the same Analyzer on both the query and the
> > index side.  But it appears to be the case that you are taking stemmed
> index
> > terms from your index #1 and then querying index #2 using these stemmed
> > terms.  If this is true, then you want the query-time analyzer in your
> > second index not to change the query terms.  You'll likely get better
> > results using WhitespaceAnalyzer, which tokenizes on whitespace and does
> no
> > further analysis, rather than StandardAnalyzer.
> >
> > Steve
> >
> > > -----Original Message-----
> > > From: manjula wijewickrema [mailto:manjula53@gmail.com]
> > > Sent: Monday, November 29, 2010 4:32 AM
> > > To: java-user@lucene.apache.org
> > > Subject: Analyzer
> > >
> > > Hi,
> > >
> > > In my work, I am using Lucene and two java classes. In the first one, I
> > > index a document and in the second one, I try to search the most
> relevant
> > > document for the indexed document in the first one. In the first java
> > > class,
> > > I use the SnowballAnalyzer in the createIndex method and
> StandardAnalyzer
> > > in
> > > the searchIndex method and pass the highest frequency terms into the
> > > second
> > > Java class. In the second class, I use SnowballAnalyzer in the
> > createIndex
> > > method (this index is for the collection of documents to be searched,
> or
> > > it
> > > is my database) and StandardAnalyser in the searchIndex method (I pass
> > the
> > > highest frequently occuring term of the first class as the search term
> > > parameter to the searchIndex method of the second class). Using
> Analyzers
> > > in
> > > this manner, what I am willing is to do the stemming, stop-words in
> both
> > > indexes (in both classes) and to search those a few high frequency
> words
> > > (of
> > > the first index) in the second index. So, if my intention is clear to
> > you,
> > > could you please let me know whether it is correct or not the way I
> have
> > > used Analyzers? I highly appreciate any comment.
> > >
> > > Thanx.
> > > Manjula.
> >
>
Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message