lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wolfgang Hoschek <wolfgang.hosc...@mac.com>
Subject Re: [jira] Field constructor, avoiding String.intern()
Date Fri, 23 Feb 2007 23:04:51 GMT

On Feb 23, 2007, at 10:28 AM, James Kennedy wrote:

>
> True. However, in the case where you are processing Documents one  
> at a time
> and discarding them (e.g. We use hitCollector to process all  
> documents from
> a search), or memory is not an issue, it would be nice to have the  
> ability
> to disable the interning for performance sake.

I don't know how much it would increase overall throughput in a  
variety of use cases, but one approach could be to add a copy-like- 
this factory method like Field.createField(Reader) to Field.java,  
analog to the method Term.createTerm(String text) that was added to  
Term.java sometime ago for a similar reason.

This would guarantee that the name continues to be interned yet  
allows to avoid the interning overhead on use cases where a field  
with the same parametrization (yet a different content String/Reader)  
is constructed many times, which is probably the most common case  
where intern() overhead might matter.

For example, something like

Field f1 = ...
Field f2 = f1.createSimilarField(Reader);

   /**
    * Optimized construction of new Terms by reusing same field as  
this Term
    * - avoids field.intern() overhead
    * @param text The text of the new term (field is implicitly same  
as this Term instance)
    * @return A new Term
    */
   public Term createTerm(String text)
   {
       return new Term(field,text,false);
   }

Wolfgang.

>
>
>
>
> Robert Engels wrote:
>>
>> I don't think it is just the performance gain of equals() where  
>> intern
>> () matters.
>>
>> It also reduces memory consumption dramatically when working with
>> large collections of documents in memory - although this could also
>> be done with constants, there is nothing in Java to enforce it (thus
>> the use of intern()).
>>
>>
>> On Feb 23, 2007, at 12:02 PM, James Kennedy wrote:
>>
>>>
>>> In our case, we're trying to optimize document() retrieval and we
>>> found that
>>> disabling the String interning in the Field constructor improved
>>> performance
>>> dramatically. I agree that interning should be an option on the
>>> constructor.
>>> For document retrieval, at least for a small of amount of fields,  
>>> the
>>> performance gain of using equals() on interned strings is no match
>>> for the
>>> performance loss of interning the field name of each field.
>>>
>>>
>>>
>>> Wolfgang Hoschek-2 wrote:
>>>>
>>>> I noticed that, too, but in my case the difference was often much
>>>> more extreme: it was one of the primary bottlenecks on indexing.  
>>>> This
>>>> is the primary reason why MemoryIndex.addField(...) navigates  
>>>> around
>>>> the problem by taking a parameter of type "String fieldName"  
>>>> instead
>>>> of type "Field":
>>>>
>>>> 	public void addField(String fieldName, TokenStream stream) {
>>>> 		/*
>>>> 		 * Note that this method signature avoids having a user call new
>>>> 		 * o.a.l.d.Field(...) which would be much too expensive due to  
>>>> the
>>>> 		 * String.intern() usage of that class.
>>>>                   */
>>>>
>>>> Wolfgang.
>>>>
>>>> On Feb 14, 2006, at 1:42 PM, Tatu Saloranta wrote:
>>>>
>>>>> After profiling in-memory indexing, I noticed that
>>>>> calls to String.intern() showed up surprisingly high;
>>>>> especially the one from Field() constructor. This is
>>>>> understandable due to overhead String.intern() has
>>>>> (being native and synchronized method; overhead
>>>>> incurred even if String is already interned), and the
>>>>> fact this essentially gets called once per
>>>>> document+field combination.
>>>>>
>>>>> Now, it would be quite easy to improve things a bit
>>>>> (in theory), such that most intern() calls could be
>>>>> avoid, transparent to the calling app; for example,
>>>>> for each IndexWriter() one could use a simple
>>>>> HashMap() for caching interned Strings. This approach
>>>>> is more than twice as fast as directly calling
>>>>> intern(). One could also use per-thread cache, or
>>>>> global one; all of which would probably be faster.
>>>>> However, Field constructor hard-codes call to
>>>>> intern(), so it would be necessary to add a new
>>>>> constructor that indicates that field name is known to
>>>>> be interned.
>>>>> And there would also need to be a way to invoke the
>>>>> new optional functionality.
>>>>>
>>>>> Has anyone tried this approach to see if speedup is
>>>>> worth the hassle (in my case it'd probably be
>>>>> something like 2 - 3%, assuming profiler's 5% for
>>>>> intern() is accurate)?
>>>>>
>>>>> -+ Tatu +-
>>>>>
>>>>>
>>>>> __________________________________________________
>>>>> Do You Yahoo!?
>>>>> Tired of spam?  Yahoo! Mail has the best spam protection around
>>>>> http://mail.yahoo.com
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> --
>>>>> -
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>
>>> -- 
>>> View this message in context: http://www.nabble.com/Field-
>>> constructor%2C-avoiding-String.intern%28%29-tf1123597.html#a9123600
>>> Sent from the Lucene - Java Developer mailing list archive at
>>> Nabble.com.
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Field- 
> constructor%2C-avoiding-String.intern%28%29-tf1123597.html#a9124055
> Sent from the Lucene - Java Developer mailing list archive at  
> Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message