lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From robert engels <>
Subject Re: Term pollution from binary data
Date Wed, 07 Nov 2007 01:15:45 GMT
I think the binary section recognizer is probably your best best.

If you write an analyzer that ignores terms that consist of only  
hexadecimal digits, and contain embedded digits, you will probably  
reduce the pollution quite a bit, and it is trivial to write, and not  
too expensive to check.

On Nov 6, 2007, at 6:56 PM, Chuck Williams wrote:

> Hi All,
> We are experiencing OOM's when binary data contained in text files  
> (e.g., a base64 section of a text file) is indexed.  We have  
> extensive recognition of file types but have encountered binary  
> sections inside of otherwise normal text files.
> We are using the default value of 128 for termIndexInterval.  The  
> problem arises because binary data generates a large set of random  
> tokens, leading to totalTerms/termIndexInterval terms stored in  
> memory.  Increasing the -Xmx is not viable as it is already maxed.
> Does anybody know of a better solution to this problem than writing  
> some kind of binary section recognizer/filter?
> It appears that termIndexInterval is factored into the stored index  
> and thus cannot be changed dynamically to work around the problem  
> after an index has become polluted.  Other than identifying the  
> documents containing binary data, deleting them, and then  
> optimizing the whole index, has anybody found a better way to  
> recover from this problem?
> Thanks for any insights or suggestions,
> Chuck
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message