lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <>
Subject Re: Performance Results on changing the way fields are stored
Date Wed, 13 Jan 2010 19:30:32 GMT
Grant Ingersoll wrote:
> On Jan 5, 2010, at 7:44 AM, Paul Taylor wrote:
>> So currently in my index I index and store a number of small fields, I need both
so I can search on the fields, then I use the stored versions to generate the output document
(which is either an XML or JSON representation), because I read stored and index fields are
dealt with completely seperately I tried another tact only storing one field which was a serialized
version of the output documentation. This solves a couple of issues I was having but I was
disappointed that both the size of the index increased and the index build  time increased,
I thought that if all the stored data was held in one field that the resultant index would
be smaller, and I didn't expect index time to increase by as much as it did. I was also suprised
that Java serilaization was slower and used more space than both JSON and XML serialization.
>> Results as Follows
>> Type:                                                             Time : Index Size
>> Only indexed  no norms                                                          
         105   : 38 MB
>> Only indexed                                                                    
                111   : 43 MB
>> Same fields written as Indexed and Stored  (current Situation)           115   :
83 MB
>> Fields Indexed, One JAXB classed Stored using JSON Marshalling 140   : 115 MB
>> Fields Indexed, One JAXB classed Stored using XML Marshalling  189   : 198 MB
>> Fields Indexed, One JAXB classed Stored using Java Serialization   305   : 485 MB
> How much more verbose are these than the "raw" content?  Even as terse as JSON is, it
is still verbose compared to a binary format, and XML Marshalling and Java Serialization will
be even more.  Given that you are likely only displaying 10 or so at a time, I'd think it
would be much more efficient to only store the minimal amount needed to recreate the docs
in the current result set.  
Yes, in the end I came to the conclusion to just stick with current 
situation except for cases where i have sets of related fields that 
would otherwise nessecitate holding 'placeholder' fields, in which case 
I've used json
> I've also seen people have success simply storing a key in Lucene that is then used for
lookup in something like Memcachedb, Tokyo Cabinet or one of the many other key-value stores.
In my situation 90% of the fields stored are also required for 
searching, so they are held in the search index anyway so there is not 
much point moving the stored version into a memcahe

thanks Paul

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message