lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <dmit...@earthlink.net>
Subject Re: Should Token be immutable?
Date Tue, 07 Jan 2003 03:53:49 GMT
Otis Gospodnetic wrote:

>Ah, sorry about bringing up performance, I mixed that with another
>thread.
>Anyhow, I still think that setPosition offers a nice feature that some
>people may want to use.  It was on a to do list for a while, and it was
>there because people requested it, so even though Lucene doesn't use
>setPosition internally, maybe Lucene-based apps out there are.
>
Most likely it would be analyzers for additional languages that would 
make use of this. One example where I have considered using this feature 
was in a special-purpose analyzer that placed multiple forms of a token 
into the same position. For example, a given word "10cm" can be parsed 
into two: "10", "cm". This would allow a document to be found when the 
query includes "10 cm" or "10cm". I ended up doing just this, but I do 
not currently bother with positions, only because I do not run phrase 
queries. However, if phrase queries were needed, I think I would 
probably want to place them at the same position.

Another example where this could be useful would be with languages where 
a single word can be composed of many component words - such as German. 
Perhaps it can also be useful in oriental languages?

Dmitry.

>
>Otis
>
>
>--- stephane vaucher <vaucher@LUB.UMontreal.CA> wrote:
>  
>
>>I'm not sure if I understand your question. I'm not trying to
>>optimise 
>>anything. This thread was spawned because the usage of Token was
>>unclear 
>>and inconsistent (I don't see the purpose here a package scoped 
>>members). The result of this is that a few of us thought that an 
>>immutable Token might be clearer.
>>
>>The most simple change (I personally believe it's an essential
>>change) 
>>is to make the members private.
>>The second change for the object to be immutable would be to remove
>>the 
>>positionIncrement, but since I'm no lucene guru, I can't tell what is
>>
>>better (hence the email).
>>
>>I'll test the simples changes tonight to see if there is a sizable 
>>performance hit, and I'll wait to see if a guru speaks out about the 
>>controversial second change (which is also trivial).
>>
>>Stephane
>>
>>Otis Gospodnetic wrote:
>>
>>    
>>
>>>It sounds to me that having the ability to do that that point 13. in
>>>CHANGES states is more important than trying to only slightly
>>>      
>>>
>>decrease
>>    
>>
>>>the number of temporary objects instantiated.
>>>
>>>By the way, have you observed or measured the difference in
>>>performance, memory consumption or anything else, before and after
>>>      
>>>
>>your
>>    
>>
>>>local changes?
>>>Not having those and making Token immutable for performance reasons
>>>would be wrong.
>>>
>>>Thanks,
>>>Otis
>>>
>>>
>>>--- stephane vaucher <vaucher@LUB.UMontreal.CA> wrote:
>>>
>>>      
>>>
>>>>I've noticed that there is a method public void
>>>>setPositionIncrement(int 
>>>>positionIncrement) that would probably have to disappear for Token
>>>>        
>>>>
>>to
>>    
>>
>>>>be 
>>>>immutable. The CHANGES.txt doc seems to mention some good reasons
>>>>        
>>>>
>>why
>>    
>>
>>>>it 
>>>>was added, but there is no code in CVS that seems to depend on it.
>>>>
>>>>>From CHANGES:
>>>>13. Added new method Token.setPositionIncrement().
>>>>
>>>>    This permits, for the purpose of phrase searching, placing
>>>>    multiple terms in a single position.  This is useful with
>>>>    stemmers that produce multiple possible stems for a word.
>>>>
>>>>    This also permits the introduction of gaps between terms, so
>>>>that
>>>>    terms which are adjacent in a token stream will not be matched
>>>>by
>>>>    and exact phrase query.  This makes it possible, e.g., to
>>>>        
>>>>
>>build
>>    
>>
>>>>    an analyzer where phrases are not matched over stop words
>>>>        
>>>>
>>which
>>    
>>
>>>>    have been removed.
>>>>
>>>>    Finally, repeating a token with an increment of zero can also
>>>>        
>>>>
>>be
>>    
>>
>>>>    used to boost scores of matches on that token.  (cutting)
>>>>
>>>>Any comments? With an immutable Token, does the positionIncrement
>>>>still 
>>>>have a reason for being there? If not, then I'll remove 
>>>>getPositionIncrement as well.
>>>>
>>>>Stephane
>>>>
>>>>Doug Cutting wrote:
>>>>
>>>>        
>>>>
>>>>>stephane vaucher wrote:
>>>>>
>>>>>          
>>>>>
>>>>>>1) Does anyone mind? Will it break anything?
>>>>>>
>>>>>>            
>>>>>>
>>>>>It shouldn't break anything.
>>>>>
>>>>>          
>>>>>
>>>>>>2) Are there units tests for this? (particularly
>>>>>>
>>>>>>            
>>>>>>
>>>>PorterStemFilter). 
>>>>
>>>>        
>>>>
>>>>>>The changes are obviously not spectacular, but I prefer not to
>>>>>>
>>>>>>            
>>>>>>
>>>>screw 
>>>>
>>>>        
>>>>
>>>>>>everyone up...
>>>>>>
>>>>>>            
>>>>>>
>>>>>I don't know of any unit tests specifically for this.  Mostly this
>>>>>          
>>>>>
>>>>>change will affect compilation.  In general though, if you don't
>>>>>
>>>>>          
>>>>>
>>>>see 
>>>>
>>>>        
>>>>
>>>>>unit tests for things that you think you might break, then it
>>>>>          
>>>>>
>>never
>>    
>>
>>>>>hurts to write more unit tests.
>>>>>
>>>>>          
>>>>>
>>>>>>3) I've checked-out the latest version of lucene, is there
>>>>>>
>>>>>>            
>>>>>>
>>>>anything 
>>>>
>>>>        
>>>>
>>>>>>special I need to do if I get the go ahead to check my stuff in
>>>>>>
>>>>>>            
>>>>>>
>>>>(like 
>>>>
>>>>        
>>>>
>>>>>>a dev list review)?
>>>>>>
>>>>>>            
>>>>>>
>>>>>If you're not a regular committer then please send diffs to
>>>>>
>>>>>          
>>>>>
>>>>lucene-dev 
>>>>
>>>>        
>>>>
>>>>>before comitting and give folks a few days to consider the
>>>>>          
>>>>>
>>changes.
>>    
>>
>>>>>Doug
>>>>>
>>>>>
>>>>>-- 
>>>>>To unsubscribe, e-mail:   
>>>>><mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>>>>>For additional commands, e-mail: 
>>>>><mailto:lucene-dev-help@jakarta.apache.org>
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>--
>>>>To unsubscribe, e-mail:  
>>>><mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>>>>For additional commands, e-mail:
>>>><mailto:lucene-dev-help@jakarta.apache.org>
>>>>
>>>>        
>>>>
>>>__________________________________________________
>>>Do you Yahoo!?
>>>Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
>>>http://mailplus.yahoo.com
>>>
>>>--
>>>To unsubscribe, e-mail:  
>>>      
>>>
>><mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>>    
>>
>>>For additional commands, e-mail:
>>>      
>>>
>><mailto:lucene-dev-help@jakarta.apache.org>
>>    
>>
>>>      
>>>
>>
>>--
>>To unsubscribe, e-mail:  
>><mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>>For additional commands, e-mail:
>><mailto:lucene-dev-help@jakarta.apache.org>
>>
>>    
>>
>
>
>__________________________________________________
>Do you Yahoo!?
>Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
>http://mailplus.yahoo.com
>
>--
>To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
>
>  
>



--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message