lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chuck Williams <ch...@manawiz.com>
Subject Term pollution from binary data
Date Wed, 07 Nov 2007 00:56:40 GMT
Hi All,

We are experiencing OOM's when binary data contained in text files 
(e.g., a base64 section of a text file) is indexed.  We have extensive 
recognition of file types but have encountered binary sections inside of 
otherwise normal text files.

We are using the default value of 128 for termIndexInterval.  The 
problem arises because binary data generates a large set of random 
tokens, leading to totalTerms/termIndexInterval terms stored in memory.  
Increasing the -Xmx is not viable as it is already maxed.

Does anybody know of a better solution to this problem than writing some 
kind of binary section recognizer/filter?

It appears that termIndexInterval is factored into the stored index and 
thus cannot be changed dynamically to work around the problem after an 
index has become polluted.  Other than identifying the documents 
containing binary data, deleting them, and then optimizing the whole 
index, has anybody found a better way to recover from this problem?

Thanks for any insights or suggestions,

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message