lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Taylor (JIRA)" <>
Subject [jira] Commented: (LUCENE-826) Language detector
Date Thu, 08 Nov 2007 18:15:50 GMT


Peter Taylor commented on LUCENE-826:

Just out of curiosity which version of Weka are you using...

I ask because in newer versions of weka...

In the source file we have the following problem...

stringToWordVector.setDelimiters(";"); <-- setDelimiters method has disappeared
stringToWordVector.setNormalizeDocLength(new SelectedTag(StringToWordVector.FILTER_NORMALIZE_ALL,
StringToWordVector.TAGS_FILTER)); <-- this works

and in older versions of weka...

In the source file we have the following problem...

stringToWordVector.setDelimiters(";"); <-- this now works :-)
stringToWordVector.setNormalizeDocLength(new SelectedTag(StringToWordVector.FILTER_NORMALIZE_ALL,
StringToWordVector.TAGS_FILTER)); <-- older versions of the API simply expect a boolean
value rather than a SelectedTag object as a param)

Please advise :-)



> Language detector
> -----------------
>                 Key: LUCENE-826
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>         Attachments: ld.tar.gz, ld.tar.gz
> A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid
false positive classifications. 
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic
support vector models) feature selection and normalization of token freuencies.  Optionally
Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
>     LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root"));
>     root.addBranch("uralic");
>     root.addBranch("fino-ugric", "uralic");
>     root.addBranch("ugric", "uralic");
>     root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
>     root.addBranch("proto-indo european");
>     root.addBranch("germanic", "proto-indo european");
>     root.addBranch("northern germanic", "germanic");
>     root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
>     root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
>     root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
>     root.addBranch("west germanic", "germanic");
>     root.addLanguage("west germanic", "eng", "english", "en", "UK");
>     root.mkdirs();
>     LanguageClassifier classifier = new LanguageClassifier(root);
>     if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
>       classifier.compileTrainingData(); // from wikipedia
>     }
>     classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of each registred
language in the language to train. Above example pass this test:
> (testEquals is the same as assertEquals, just not required. Only one of them fail, see
> {code}
>     assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
>     testEquals("swe", classifier.classify(norway_in_swedish).getISO());
>     testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
>     testEquals("swe", classifier.classify(finland_in_swedish).getISO());
>     testEquals("swe", classifier.classify(uk_in_swedish).getISO());
>     testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
>     assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
>     testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
>     testEquals("fin", classifier.classify(norway_in_finnish).getISO());
>     testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
>     assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
>     testEquals("fin", classifier.classify(uk_in_finnish).getISO());
>     testEquals("dan", classifier.classify(sweden_in_danish).getISO());
>     // it is ok that this fails. dan and nor are very similar, and the document about
norway in danish is very small.
>     testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
>     assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
>     testEquals("dan", classifier.classify(finland_in_danish).getISO());
>     testEquals("dan", classifier.classify(uk_in_danish).getISO());
>     testEquals("eng", classifier.classify(sweden_in_english).getISO());
>     testEquals("eng", classifier.classify(norway_in_english).getISO());
>     testEquals("eng", classifier.classify(denmark_in_english).getISO());
>     testEquals("eng", classifier.classify(finland_in_english).getISO());
>     assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs for now.
I'll try do more work on considering the language trees when classifying.
> It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message