From Peter Karman <>
Subject Re: Invalid UTF-8
Date Thu, 28 Jan 2010 04:43:22 GMT
Marvin Humphrey wrote on 1/27/10 6:41 PM:
> On Tue, Jan 26, 2010 at 07:15:16PM -0800, Marvin Humphrey wrote:
>> Yup, I've now duplicated the problem on my system using 60,000 docs.  
> Fixed by r5764.

cool. thanks for digging in.

I have tested it under RHEL (works great with ~90k docs, 2g of data) and OSX 
10.6 (where it fails, see below), both 64-bit arch.

The OSX behaviour was weird. First time it segfaulted. Ran it again under gdb 
and it completed ok. Ran it again without gdb and I got this:

[karpet@pekmac:~/tmp]$ perl swishdocs2/
Crawled 1000000 documents
Read past EOF of 
'/Volumes/users/karpet/tmp/test-ks-utf8/seg_2/ptemp-4284913-to-4383411' (offset: 
4284913 len: 98498), S_refill at ../core/KinoSearch/Store/InStream.c line 145
  at line 65

Using same test script as I posted before, with 1m docs instead of 33k.

>> I bet I can get that way down by fiddling with the flush threshold.
> Ultimately, I was isolate the trigger to a single document with two fields, by
> bringing the threshold at which PostingListWriter flushes all of its
> PostingPools way, way down:
> -#define DEFAULT_MEM_THRESH 0x1000000
> +/* #define DEFAULT_MEM_THRESH 0x1000000 */
> +#define DEFAULT_MEM_THRESH 0x10
> When that variable lived in Perl, the KinoSearch::Test module used to set it
> to a much smaller number at load time.  This had the effect of simulating
> large indexes as far as PostingListWriter was concerned, by forcing runs to be
> flushed many many times.  However, it turns out that we have been doing
> without that important simulation for a long time -- the entire KS test suite
> was not triggering a PostingPool flush even once.  I'm a little surprised that
> after all the refactoring I did on this code recently, there was only a single
> glitch that needed to be fixed.  
> Now even if I set the threshold to 0x100, the whole test suite passes.

this is good and interesting to know. Is there, or any plan to, make the 
DEFAULT_MEM_THRESH alterable at runtime? I'm assuming that in situations where 
available ram is low, it would be helpful to trade-off speed for memory by 
setting the threshold lower and flushing to disk more often. Is that a realistic 

Peter Karman  .  .

