lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Pimley <>
Subject Source code for an accent-removal filter
Date Tue, 01 Feb 2005 10:25:28 GMT


In December I made some posts concerning a filter that could work by 
getting the unicode name of a character and trying to figure out the 
closest latin equivalent.  For example, if it encountered character 00C1 
LATIN CAPITAL LETTER A WITH ACUTE, it would be clever enough to replace 
that with regular 'A'.

I got moved onto another project for a while so I've not looked at the 
problem much since then.  I'm back on it for a few days now though :)

The following perl program generates some Java source for a filter that 
carries out the above task.

Get 'UnicodeData.txt' from, and then do the following:
    perl < UnicodeData.txt
to generate make/this/java/

This comes with no license and no warranty  ;)

Do not think this is the full solution to your unicode-mangling 
problems.  I'm using it as a last resort catch-all after some other 
filters that use the IBM ICU4J library to do all sorts of decomposition 
and character-category magic.  Once I get it all working I should be 
able to post some pointers and code snippets up here.



# usage:  perl my.full.ClassName < UnicodeData.txt
# creates my/full/

use strict;
use warnings;

use File::Path;
use File::Basename;

# decompose the classname that they gave us.
# TODO: this doesn't work if the classname has no dots (i.e. it's not in a
# package)
my $full_class = shift;
my @parts = $full_class =~ '^(.*)\.(.*)$';
my $package = shift @parts;
my $class = shift @parts;

# print to the correct place
my $path = $full_class;
$path =~ s/\./\//g;
$path = "$";
mkpath dirname $path;
open STDOUT, "> $path" or die "Could not redirect stdout";

print <<END_JAVA;


package $package;

import org.apache.lucene.analysis.*;
import java.util.*;

public class $class extends TokenFilter {

    public $class (TokenStream input) {
        super (input);

    // The replacement character, indexed by unicode value.
    // (i.e Character objects indexed by Integer objects)
    private static Hashtable values = null;

    // Creates a HashTable from the array at the bottom of this file.
    private void createHash () {
        // only run this for the first object of this class
        if (values != null) return;
        values = new Hashtable ();

        int i = 0;
        while (true) {
            if (array[i] == null) break; // 'array' is null terminated.

            Object number = array[i++];
            Object replacement = array[i++];

            values.put (number, replacement);

        // we're done with 'array', it can be garbage collected
        array = null;

    public Token next () throws IOException {
        Token t = ();
        if (t == null) return null; // eof

        String s = t.termText();
        s = substituteAZString (s);

        return new Token (s, t.startOffset(), t.endOffset());

    private String substituteAZString (String s) {

        char [] current = s.toCharArray ();
        char [] AZ = new char [current.length];
        int AZi = 0;

        for (int i=0; i<current.length; i++) {
            AZ[AZi++] = substituteAZChar (current[i]);

        s = new String (AZ);
        return s;

    private char substituteAZChar (char c) {
        Integer key = new Integer ((int) c);
        if (values.containsKey(key)) {
            c = ((Character)values.get(key)).charValue();
        return c;

    private static Object [] array = {

# we only care about characters whose names are of the form:
my $latin_pattern = 'LATIN (.*) LETTER (.)( .*)$';

while (<STDIN>) {
    my @parts = split ";";

    my $num  = shift @parts;
    my $name = shift @parts;

    my @matches;

    if (@matches = ($name =~ $latin_pattern)) {

        my $case = shift @matches;
        my $convert_to_lc = $case eq "SMALL";

        my $letter = shift @matches;
        $letter = lc $letter if $convert_to_lc;

        printf "    new Integer (0x%s), new Character ('%s'), // %s\n",
            $num, $letter, $name;

print <<END_JAVA;
    null };

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message