lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shakti_Sareen" <Shakti_Sar...@satyam.com>
Subject help required
Date Fri, 23 Nov 2007 11:39:27 GMT
Hi 

Can anyone help me with a code having a class which extends the
StandardAnalyzer and that analyzer should not tokenize the word across
hyphen

 
SHAKTI SAREEN

-----Original Message-----
From: Shai Erera [mailto:serera@gmail.com] 
Sent: Thursday, November 22, 2007 9:53 PM
To: java-user@lucene.apache.org
Subject: Re: help required urgent!!!!!!!!!!!

The thing is - StandardAnalyzer breaks on hyphen. You'll need to work
around
this by either extend StandardAnalyzer

>From StandardTokenizer's documentation (which is used by
StandardAnalyzer):
*   <li> *Splits words at hyphens, unless there's a number in the token,
in
which case
 *     the whole token is interpreted as a product number and is not
split.*

I've investigated StandardAnalyzer's tokenization and it doesn't look
simple
to disable that behavior. What you can do is extend StandardAnalyzer and
override its tokenStream method to create a TokenStream of your own. If
you
know your text is space separated, you can use StringTokenizer to split
the
text on spaces. If a token contains '-', don't break it, otherwise pass
it
forward the the TokenStream returned by StandardAnalyzer.

Maybe someone else has a better answer, but if you insist on using
StandardAnalyzer, I have a feeling it will be problematic.

On Nov 22, 2007 6:02 PM, Shakti_Sareen < Shakti_Sareen@satyam.com>
wrote:

> Hi
>
> But the file I am indexing is very big and I don't know which word
will
> contain the hyphen. The thing you suggest can be implemented only if
> there are some specific words in the file.
>
> Apart from StandardAnalyzer I have got no option.
>
> Thanks a lot for your reply.
>
> Please suggest me how can I go ahead.
>
>
> SHAKTI SAREEN
> GE-GDC
> STC HYDERABAD
> 9948777794
>
> -----Original Message-----
> From: Shai Erera [mailto:serera@gmail.com]
> Sent: Thursday, November 22, 2007 9:25 PM
> To: java-user@lucene.apache.org
> Subject: Re: help required urgent!!!!!!!!!!!
>
> Hi
>
> You can simply create a PrefixQuery. However, if you're using
> StandardAnalyzer, and the word is added as Index.TOKENIZED,
> sotf-wa<something> will be broken to 'soft' and 'wa<something>'.
> Therefore
> you'll need to add the word as Index.UN_TOKENIZED, or use a different
> Analyzer when you index the data (for this field at least).
>
> Here's a sample code:
>
>        // Indexing.
>        Document doc = new Document();
>        doc.add(new Field("field", "soft-wash", Store.NO,
> Index.UN_TOKENIZED
> ));
>
>        // Search
>        Query q = new PrefixQuery(new Term("field", "soft-wa"));
>
> Does that help?
>
> On Nov 22, 2007 5:46 PM, Shakti_Sareen < Shakti_Sareen@satyam.com>
wrote:
>
> > Hi
> > I am using StandardAnalyser() to index the data.
> > But I want to do a like search on a word containing Hyphen
> > For example it want to search a word "soft-wa*"
> >
> > I am getting no hits for that. It is said that if the hyphen is
there
> in
> > the word, then we should include that word in the double quotes (").
> But
> > enclosing the word in a double quotes (") means the exact word
search.
> >
> > How can I perform the like search on a word containing hyphen???????
> >
> > Please help.
> >
> > Regards,
> > Shakti Sareen
> >
> >
> >
> >
> >
> > DISCLAIMER:
> > This email (including any attachments) is intended for the sole use
of
> the
> > intended recipient/s and may contain material that is CONFIDENTIAL
AND
> > PRIVATE COMPANY INFORMATION. Any review or reliance by others or
> copying or
> > distribution or forwarding of any or all of the contents in this
> message is
> > STRICTLY PROHIBITED. If you are not the intended recipient, please
> contact
> > the sender by email and delete all copies; your cooperation in this
> regard
> > is appreciated.
> >
> >
---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> --
> Regards,
>
> Shai Erera
>
>
> DISCLAIMER:
> This email (including any attachments) is intended for the sole use of
the
> intended recipient/s and may contain material that is CONFIDENTIAL AND
> PRIVATE COMPANY INFORMATION. Any review or reliance by others or
copying or
> distribution or forwarding of any or all of the contents in this
message is
> STRICTLY PROHIBITED. If you are not the intended recipient, please
contact
> the sender by email and delete all copies; your cooperation in this
regard
> is appreciated.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
Regards,

Shai Erera


DISCLAIMER:
This email (including any attachments) is intended for the sole use of the intended recipient/s
and may contain material that is CONFIDENTIAL AND PRIVATE COMPANY INFORMATION. Any review
or reliance by others or copying or distribution or forwarding of any or all of the contents
in this message is STRICTLY PROHIBITED. If you are not the intended recipient, please contact
the sender by email and delete all copies; your cooperation in this regard is appreciated.


Mime
View raw message