lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon McDuff <smcd...@hotmail.com>
Subject RE: Lucene 4.0 .FDT
Date Thu, 19 Jul 2012 15:36:37 GMT

Thank you for your answer.

In our case, in 983 seconds of processing, the size of these file are:
- *.fdt : 366 Megs
- *.fdx : 2898 Megs

It is kind of useless to write more than 3 gigs for nothing...

We already modified Lucene40StoredFieldsWriter to fix our problems.....

I hope we could do something about it.

We also modified some places where we created objects for each document added that was not
necessary and causing the garbage collector to kicked in.
We instead reused as much as we can memory. (ex: StoreFieldsConsumer.fieldInfos and StoreFieldsConsumer.storedFields
are created each time reset is called....)

Simon







> Date: Thu, 19 Jul 2012 16:44:28 +0200
> From: ab@getopt.org
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 4.0 .FDT
> 
> On 19/07/2012 14:26, Simon McDuff wrote:
> >
> > I'm using Lucene 4.0.
> >
> > I'm inserting around 300 000 documents / seconds.
> >
> > We do not have any store fields. But we noticed that .fdt get populated even so.
> >
> > .fdx contains useless informations.
> > .fdt contains only zero....useless...
> >
> > Is there a way to minimize the impact ?
> 
> This happens because the Lucene40StoredFieldsWriter (part of the 
> Lucene40 Codec) uses a simplistic layout for the data - for every 
> document it writes a long to the .fdx file (8 bytes) to mark the 
> position of the fields' data, and a vint to the .fdt file (at least one 
> byte) to record the number of fields, and then the actual stored fields' 
> data.
> 
> We could modify this format to be less verbose for documents without 
> stored fields, e.g. use block-delta encoding of the .fdx file and avoid 
> writing anything to the .fdt file if there are no stored fields. The 
> question is whether the space savings would be worth the complication?
> 
> -- 
> Best regards,
> Andrzej Bialecki
> http://www.sigram.com, blog http://www.sigram.com/blog
>   ___.,___,___,___,_._. __________________<><____________________
> [___||.__|__/|__||\/|: Information Retrieval, System Integration
> ___|||__||..\|..||..|: Contact: info at sigram dot com
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message