lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Culley Angus <Culley.An...@SilentOne.com>
Subject Large stop word list
Date Tue, 13 Nov 2001 04:46:09 GMT
I am currently looking at building a brute force analyzer that will
effectively be able to index the textual content of any 
binary file (within reason).

This is being done by weeding out as much predictable 'binary junk' from the
stream, then building a (large) stop word list to filter the rest out.

In order to do this, I may need a rather large stop word list, to filter out
the inevitable junk that will be encountered, 
and I am a little worried about the possible size of this list.

Has anyone had problems with something like this, 
specifically with a brute force-like filter or involving a particularly
large stop word list?

Thanks,
Culley.

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message