lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Invalid UTF-8
Date Wed, 27 Jan 2010 02:03:19 GMT
On Tue, Jan 26, 2010 at 12:09:20AM -0600, Peter Karman wrote:
> Here's the test case.

Thanks for the hard work building this case.

>  perl docmaker.pl \
>     --utf_factor=0 \
>     --write_files \
>     --tmp_dir path/to/my/testdocs/ \
>     --max_files 33000 \
>     --max_words 3 \
>     --tmp_dir_segments 2

I wonder whether this produces the same corpus on my OS X 10.5.8 MBPro as on
your system.

> there appears to be something magical in the *total number* of terms parsed.

Might have something to do with when runs are flushed.

> Here are some things I notice.
> 
> 1) if I comment out the swishwordnum and swishdescription in parse_file() 
> it works.
> 
> 2) if I comment out the swishdescription alone, it fails.
> 
> 3) if I comment out the swishwordnum alone, it fails.

I tried out all four possible permutations of swishwordnum and
swishdescription:

         swishdescription  => "",  # yes, empty
         swishwordnum      => 0,   # yes, zero

         #swishdescription  => "",  # yes, empty
         swishwordnum      => 0,   # yes, zero
    
         swishdescription  => "",  # yes, empty
         #swishwordnum      => 0,   # yes, zero

         #swishdescription  => "",  # yes, empty
         #swishwordnum      => 0,   # yes, zero

No matter what, I see the following output:

marvin@smokey:~/projects/ks/perl $ rm -rf test-ks-utf8/ ; perl -Mblib karpet_utf8_test.pl
testdocs/
Crawled 33000 documents
marvin@smokey:~/projects/ks/perl $ 


Before we go further, what kind of system are you having trouble on?  Is it a
64-bit box?

Marvin Humphrey


Mime
View raw message