lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Toru Matsuzawa (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-902) Check on PositionIncrement with StopFilter.
Date Fri, 08 Jun 2007 09:16:26 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502731
] 

Toru Matsuzawa edited comment on LUCENE-902 at 6/8/07 2:14 AM:
---------------------------------------------------------------

Hi Hoss,
Thank you your comments.

> 1) in future patches, could you please use 2 spaces instead of tabs?

It consented.

> 2) am i understanding correctly that the primary use case you are trying to address is
>  stop word removal when the stop word has synonyms with a position increment of 0 
> (the expectation being that the synonyms also be removed) ?

Your understanding is correct.
However, a synonym itself might be a stop word. 

>  ... if so, wouldn't the most efficient thing be to do stop word removal before doing

> synonym expansion? (it means having a bigger stop word list - with all the synonyms -

> but that still seems better to me) ... are there other use cases i'm not understanding?
...
>  i freely admit i don't understand the "Japanese morphological analysis" comment.

It is not realistic to have a stop word list with all the synonyms 
because the morphological engine must understand all the dictionaries to make that list.
(The engine analyzes texts with such dictionaries.)

> 3) i only glanced over the specifics of removeStopwordCollocatesNext() .. 
> but would promoting BufferedTokenStream from Solr simplify the code
>  (it seems to all be about buffering tokens) ...
http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/analysis/BufferedTokenStream.java?view=markup

I think that it becomes more concise if BufferedTokenStream can be used. 

> 4) it would be useful if the test case could clarify not only the expected tokens text

> concatenated together, but also what the expected positions of position increments are

> for the tokens... i was certainly confused by the title of this issue.

I agree with you. It would be better to compare them with expected tokens. 
I'm sorry to confuse you with my poor English.



 was:
Hi Hoss,
Than you your comments.

> 1) in future patches, could you please use 2 spaces instead of tabs?

It consented.

> 2) am i understanding correctly that the primary use case you are trying to address is
>  stop word removal when the stop word has synonyms with a position increment of 0 
> (the expectation being that the synonyms also be removed) ?

Your understanding is correct.
However, a synonym itself might be a stop word. 

>  ... if so, wouldn't the most efficient thing be to do stop word removal before doing

> synonym expansion? (it means having a bigger stop word list - with all the synonyms -

> but that still seems better to me) ... are there other use cases i'm not understanding?
...
>  i freely admit i don't understand the "Japanese morphological analysis" comment.

It is not realistic to have a stop word list with all the synonyms 
because the morphological engine must understand all the dictionaries to make that list.
(The engine analyzes texts with such dictionaries.)

> 3) i only glanced over the specifics of removeStopwordCollocatesNext() .. 
> but would promoting BufferedTokenStream from Solr simplify the code
>  (it seems to all be about buffering tokens) ...
http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/analysis/BufferedTokenStream.java?view=markup

I think that it becomes more concise if BufferedTokenStream can be used. 

> 4) it would be useful if the test case could clarify not only the expected tokens text

> concatenated together, but also what the expected positions of position increments are

> for the tokens... i was certainly confused by the title of this issue.

I agree with you. It would be better to compare them with expected tokens. 
I'm sorry to confuse you with my poor English.


> Check on PositionIncrement  with StopFilter.
> --------------------------------------------
>
>                 Key: LUCENE-902
>                 URL: https://issues.apache.org/jira/browse/LUCENE-902
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Toru Matsuzawa
>         Attachments: stopfilter.patch, stopfilter20070604.patch, stopfilter20070605.patch,
stopfilter20070608.patch
>
>
> PositionIncrement set with Tokenizer is not considered with StopFilter. 
> When PositionIncrement of Token is 1, it is deleted by StopFilter. However, when PositionIncrement
of Token following afterwards is 0, it is not deleted. 
> I think that it is necessary to be deleted. Because it is thought same Token when PositionIncrement
is 0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message