lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Should Token be immutable?
Date Tue, 07 Jan 2003 04:51:32 GMT
I know a person who submitted Finish Analyzer,a looong time ago had
mentioned this same thing - storing multiple variations of the word in
the same position.

Otis

--- Dmitry Serebrennikov <dmitrys@earthlink.net> wrote:
> Otis Gospodnetic wrote:
> 
> >Ah, sorry about bringing up performance, I mixed that with another
> >thread.
> >Anyhow, I still think that setPosition offers a nice feature that
> some
> >people may want to use.  It was on a to do list for a while, and it
> was
> >there because people requested it, so even though Lucene doesn't use
> >setPosition internally, maybe Lucene-based apps out there are.
> >
> Most likely it would be analyzers for additional languages that would
> 
> make use of this. One example where I have considered using this
> feature 
> was in a special-purpose analyzer that placed multiple forms of a
> token 
> into the same position. For example, a given word "10cm" can be
> parsed 
> into two: "10", "cm". This would allow a document to be found when
> the 
> query includes "10 cm" or "10cm". I ended up doing just this, but I
> do 
> not currently bother with positions, only because I do not run phrase
> 
> queries. However, if phrase queries were needed, I think I would 
> probably want to place them at the same position.
> 
> Another example where this could be useful would be with languages
> where 
> a single word can be composed of many component words - such as
> German. 
> Perhaps it can also be useful in oriental languages?
> 
> Dmitry.
> 
> >
> >Otis
> >
> >
> >--- stephane vaucher <vaucher@LUB.UMontreal.CA> wrote:
> >  
> >
> >>I'm not sure if I understand your question. I'm not trying to
> >>optimise 
> >>anything. This thread was spawned because the usage of Token was
> >>unclear 
> >>and inconsistent (I don't see the purpose here a package scoped 
> >>members). The result of this is that a few of us thought that an 
> >>immutable Token might be clearer.
> >>
> >>The most simple change (I personally believe it's an essential
> >>change) 
> >>is to make the members private.
> >>The second change for the object to be immutable would be to remove
> >>the 
> >>positionIncrement, but since I'm no lucene guru, I can't tell what
> is
> >>
> >>better (hence the email).
> >>
> >>I'll test the simples changes tonight to see if there is a sizable 
> >>performance hit, and I'll wait to see if a guru speaks out about
> the 
> >>controversial second change (which is also trivial).
> >>
> >>Stephane
> >>
> >>Otis Gospodnetic wrote:
> >>
> >>    
> >>
> >>>It sounds to me that having the ability to do that that point 13.
> in
> >>>CHANGES states is more important than trying to only slightly
> >>>      
> >>>
> >>decrease
> >>    
> >>
> >>>the number of temporary objects instantiated.
> >>>
> >>>By the way, have you observed or measured the difference in
> >>>performance, memory consumption or anything else, before and after
> >>>      
> >>>
> >>your
> >>    
> >>
> >>>local changes?
> >>>Not having those and making Token immutable for performance
> reasons
> >>>would be wrong.
> >>>
> >>>Thanks,
> >>>Otis
> >>>
> >>>
> >>>--- stephane vaucher <vaucher@LUB.UMontreal.CA> wrote:
> >>>
> >>>      
> >>>
> >>>>I've noticed that there is a method public void
> >>>>setPositionIncrement(int 
> >>>>positionIncrement) that would probably have to disappear for
> Token
> >>>>        
> >>>>
> >>to
> >>    
> >>
> >>>>be 
> >>>>immutable. The CHANGES.txt doc seems to mention some good reasons
> >>>>        
> >>>>
> >>why
> >>    
> >>
> >>>>it 
> >>>>was added, but there is no code in CVS that seems to depend on
> it.
> >>>>
> >>>>From CHANGES:
> >>>>13. Added new method Token.setPositionIncrement().
> >>>>
> >>>>    This permits, for the purpose of phrase searching, placing
> >>>>    multiple terms in a single position.  This is useful with
> >>>>    stemmers that produce multiple possible stems for a word.
> >>>>
> >>>>    This also permits the introduction of gaps between terms, so
> >>>>that
> >>>>    terms which are adjacent in a token stream will not be
> matched
> >>>>by
> >>>>    and exact phrase query.  This makes it possible, e.g., to
> >>>>        
> >>>>
> >>build
> >>    
> >>
> >>>>    an analyzer where phrases are not matched over stop words
> >>>>        
> >>>>
> >>which
> >>    
> >>
> >>>>    have been removed.
> >>>>
> >>>>    Finally, repeating a token with an increment of zero can also
> >>>>        
> >>>>
> >>be
> >>    
> >>
> >>>>    used to boost scores of matches on that token.  (cutting)
> >>>>
> >>>>Any comments? With an immutable Token, does the positionIncrement
> >>>>still 
> >>>>have a reason for being there? If not, then I'll remove 
> >>>>getPositionIncrement as well.
> >>>>
> >>>>Stephane
> >>>>
> >>>>Doug Cutting wrote:
> >>>>
> >>>>        
> >>>>
> >>>>>stephane vaucher wrote:
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>1) Does anyone mind? Will it break anything?
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>It shouldn't break anything.
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>2) Are there units tests for this? (particularly
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>PorterStemFilter). 
> >>>>
> >>>>        
> >>>>
> >>>>>>The changes are obviously not spectacular, but I prefer not to
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>screw 
> >>>>
> >>>>        
> >>>>
> >>>>>>everyone up...
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>I don't know of any unit tests specifically for this.  Mostly
> this
> >>>>>          
> >>>>>
> >>>>>change will affect compilation.  In general though, if you don't
> >>>>>
> >>>>>          
> >>>>>
> >>>>see 
> >>>>
> >>>>        
> >>>>
> >>>>>unit tests for things that you think you might break, then it
> >>>>>          
> >>>>>
> >>never
> >>    
> >>
> >>>>>hurts to write more unit tests.
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>3) I've checked-out the latest version of lucene, is there
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>anything 
> >>>>
> >>>>        
> >>>>
> >>>>>>special I need to do if I get the go ahead to check my stuff
in
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>(like 
> >>>>
> >>>>        
> >>>>
> >>>>>>a dev list review)?
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>If you're not a regular committer then please send diffs to
> >>>>>
> >>>>>          
> >>>>>
> >>>>lucene-dev 
> >>>>
> >>>>        
> >>>>
> >>>>>before comitting and give folks a few days to consider the
> >>>>>          
> >>>>>
> >>changes.
> >>    
> >>
> >>>>>Doug
> >>>>>
> >>>>>
> >>>>>-- 
> >>>>>To unsubscribe, e-mail:   
> >>>>><mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> >>>>>For additional commands, e-mail: 
> >>>>><mailto:lucene-dev-help@jakarta.apache.org>
> >>>>>
> >>>>>
> >>>>>          
> >>>>>
> >>>>--
> >>>>To unsubscribe, e-mail:  
> >>>><mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> >>>>For additional commands, e-mail:
> >>>><mailto:lucene-dev-help@jakarta.apache.org>
> >>>>
> >>>>        
> >>>>
> >>>__________________________________________________
> >>>Do you Yahoo!?
> >>>Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> >>>http://mailplus.yahoo.com
> >>>
> >>>--
> >>>To unsubscribe, e-mail:  
> >>>      
> >>>
> >><mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> >>    
> >>
> >>>For additional commands, e-mail:
> >>>      
> >>>
> >><mailto:lucene-dev-help@jakarta.apache.org>
> >>    
> >>
> >>>      
> >>>
> >>
> >>--
> >>To unsubscribe, e-mail:  
> >><mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> >>For additional commands, e-mail:
> >><mailto:lucene-dev-help@jakarta.apache.org>
> >>
> >>    
> >>
> >
> >
> >__________________________________________________
> >Do you Yahoo!?
> >Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> >http://mailplus.yahoo.com
> >
> >--
> >To unsubscribe, e-mail:  
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> >For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> >
> >
> >  
> >
> 
> 
> 
> --
> To unsubscribe, e-mail:  
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message