lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: help required
Date Fri, 23 Nov 2007 13:28:27 GMT
Copy the current Standard Analyzer and add '-' to the definition of a 
LETTER. You might work with the StandardAnazlyer on the trunk which uses 
JFlex rather than the current JavaCC flavor...the new one is something 
like 6-10 times faster.

- Mark

Shakti_Sareen wrote:
> Hi 
>
> Can anyone help me with a code having a class which extends the
> StandardAnalyzer and that analyzer should not tokenize the word across
> hyphen
>
>  
> SHAKTI SAREEN
>
> -----Original Message-----
> From: Shai Erera [mailto:serera@gmail.com] 
> Sent: Thursday, November 22, 2007 9:53 PM
> To: java-user@lucene.apache.org
> Subject: Re: help required urgent!!!!!!!!!!!
>
> The thing is - StandardAnalyzer breaks on hyphen. You'll need to work
> around
> this by either extend StandardAnalyzer
>
> >From StandardTokenizer's documentation (which is used by
> StandardAnalyzer):
> *   <li> *Splits words at hyphens, unless there's a number in the token,
> in
> which case
>  *     the whole token is interpreted as a product number and is not
> split.*
>
> I've investigated StandardAnalyzer's tokenization and it doesn't look
> simple
> to disable that behavior. What you can do is extend StandardAnalyzer and
> override its tokenStream method to create a TokenStream of your own. If
> you
> know your text is space separated, you can use StringTokenizer to split
> the
> text on spaces. If a token contains '-', don't break it, otherwise pass
> it
> forward the the TokenStream returned by StandardAnalyzer.
>
> Maybe someone else has a better answer, but if you insist on using
> StandardAnalyzer, I have a feeling it will be problematic.
>
> On Nov 22, 2007 6:02 PM, Shakti_Sareen < Shakti_Sareen@satyam.com>
> wrote:
>
>   
>> Hi
>>
>> But the file I am indexing is very big and I don't know which word
>>     
> will
>   
>> contain the hyphen. The thing you suggest can be implemented only if
>> there are some specific words in the file.
>>
>> Apart from StandardAnalyzer I have got no option.
>>
>> Thanks a lot for your reply.
>>
>> Please suggest me how can I go ahead.
>>
>>
>> SHAKTI SAREEN
>> GE-GDC
>> STC HYDERABAD
>> 9948777794
>>
>> -----Original Message-----
>> From: Shai Erera [mailto:serera@gmail.com]
>> Sent: Thursday, November 22, 2007 9:25 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: help required urgent!!!!!!!!!!!
>>
>> Hi
>>
>> You can simply create a PrefixQuery. However, if you're using
>> StandardAnalyzer, and the word is added as Index.TOKENIZED,
>> sotf-wa<something> will be broken to 'soft' and 'wa<something>'.
>> Therefore
>> you'll need to add the word as Index.UN_TOKENIZED, or use a different
>> Analyzer when you index the data (for this field at least).
>>
>> Here's a sample code:
>>
>>        // Indexing.
>>        Document doc = new Document();
>>        doc.add(new Field("field", "soft-wash", Store.NO,
>> Index.UN_TOKENIZED
>> ));
>>
>>        // Search
>>        Query q = new PrefixQuery(new Term("field", "soft-wa"));
>>
>> Does that help?
>>
>> On Nov 22, 2007 5:46 PM, Shakti_Sareen < Shakti_Sareen@satyam.com>
>>     
> wrote:
>   
>>> Hi
>>> I am using StandardAnalyser() to index the data.
>>> But I want to do a like search on a word containing Hyphen
>>> For example it want to search a word "soft-wa*"
>>>
>>> I am getting no hits for that. It is said that if the hyphen is
>>>       
> there
>   
>> in
>>     
>>> the word, then we should include that word in the double quotes (").
>>>       
>> But
>>     
>>> enclosing the word in a double quotes (") means the exact word
>>>       
> search.
>   
>>> How can I perform the like search on a word containing hyphen???????
>>>
>>> Please help.
>>>
>>> Regards,
>>> Shakti Sareen
>>>
>>>
>>>
>>>
>>>
>>> DISCLAIMER:
>>> This email (including any attachments) is intended for the sole use
>>>       
> of
>   
>> the
>>     
>>> intended recipient/s and may contain material that is CONFIDENTIAL
>>>       
> AND
>   
>>> PRIVATE COMPANY INFORMATION. Any review or reliance by others or
>>>       
>> copying or
>>     
>>> distribution or forwarding of any or all of the contents in this
>>>       
>> message is
>>     
>>> STRICTLY PROHIBITED. If you are not the intended recipient, please
>>>       
>> contact
>>     
>>> the sender by email and delete all copies; your cooperation in this
>>>       
>> regard
>>     
>>> is appreciated.
>>>
>>>
>>>       
> ---------------------------------------------------------------------
>   
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>       
>> --
>> Regards,
>>
>> Shai Erera
>>
>>
>> DISCLAIMER:
>> This email (including any attachments) is intended for the sole use of
>>     
> the
>   
>> intended recipient/s and may contain material that is CONFIDENTIAL AND
>> PRIVATE COMPANY INFORMATION. Any review or reliance by others or
>>     
> copying or
>   
>> distribution or forwarding of any or all of the contents in this
>>     
> message is
>   
>> STRICTLY PROHIBITED. If you are not the intended recipient, please
>>     
> contact
>   
>> the sender by email and delete all copies; your cooperation in this
>>     
> regard
>   
>> is appreciated.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>     
>
>
>
>   
> ------------------------------------------------------------------------
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message