lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <yo...@lucidimagination.com>
Subject Re: Undo hyphenation when indexing
Date Fri, 01 Apr 2011 16:23:48 GMT
Solr has a hyphenated word filter you could copy.
http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenatedWordsFilterFactory.html

On trunk, this has been folded into the analysis module.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

On Fri, Apr 1, 2011 at 11:50 AM, Wulf Berschin <berschin@dosco.de> wrote:
> Hi,
>
> for indexing PDF files we have to undo word hyphenation. The basic idea is
> simply to remove the hyphen when a new line and a small letter follows. Of
> course this approach isnt 100%-foolproofed but checking against a dictionary
> wouldnt be as well...
>
> Since we face this problem too when highlighting using HTMLCharStripper
> (yes, we do have hyphenation in our HTML docs...) it seems to me I have to
> adjust the JFlex generated StandardTokenizerImpl.
>
> Is this the right approach and hwo would I have to modify this script?
>
> Thanks
> Wulf
>
>
> PS: I see that there are changes made in the brand new 3.1.0 version we are
> using 3.0.3, but as far I understand no relevant changes in this respect.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message