lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Performance Results on changing the way fields are stored
Date Wed, 06 Jan 2010 21:49:00 GMT

On Jan 5, 2010, at 7:44 AM, Paul Taylor wrote:

> So currently in my index I index and store a number of small fields, I need both so I
can search on the fields, then I use the stored versions to generate the output document (which
is either an XML or JSON representation), because I read stored and index fields are dealt
with completely seperately I tried another tact only storing one field which was a serialized
version of the output documentation. This solves a couple of issues I was having but I was
disappointed that both the size of the index increased and the index build  time increased,
I thought that if all the stored data was held in one field that the resultant index would
be smaller, and I didn't expect index time to increase by as much as it did. I was also suprised
that Java serilaization was slower and used more space than both JSON and XML serialization.
> Results as Follows
> Type:                                                             Time : Index Size
> Only indexed  no norms                                                              
     105   : 38 MB
> Only indexed                                                                        
            111   : 43 MB
> Same fields written as Indexed and Stored  (current Situation)           115   : 83 MB
> Fields Indexed, One JAXB classed Stored using JSON Marshalling 140   : 115 MB
> Fields Indexed, One JAXB classed Stored using XML Marshalling  189   : 198 MB
> Fields Indexed, One JAXB classed Stored using Java Serialization   305   : 485 MB

How much more verbose are these than the "raw" content?  Even as terse as JSON is, it is still
verbose compared to a binary format, and XML Marshalling and Java Serialization will be even
more.  Given that you are likely only displaying 10 or so at a time, I'd think it would be
much more efficient to only store the minimal amount needed to recreate the docs in the current
result set.  

I've also seen people have success simply storing a key in Lucene that is then used for lookup
in something like Memcachedb, Tokyo Cabinet or one of the many other key-value stores.


Grant Ingersoll

Search the Lucene ecosystem using Solr/Lucene:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message