lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wettin (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-826) Language detector
Date Wed, 07 Mar 2007 08:40:24 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478712
] 

Karl Wettin commented on LUCENE-826:
------------------------------------

Ahhh, I could not let be go without some more tests. Added a bunch of languages and it seems
as it works quite splendid. Again, 10-cross fold validation output on 160+ characters long
paragraphs:

Time taken to build model: 45.51 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances        5566               98.8808 %
Incorrectly Classified Instances        63                1.1192 %
Kappa statistic                          0.9874
Mean absolute error                      0.139 
Root mean squared error                  0.2555
Relative absolute error                 93.6301 %
Root relative squared error             93.7791 %
Total Number of Instances             5629     

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
  0.996     0.003      0.988     0.996     0.992      0.997    eng
  0.988     0          0.998     0.988     0.993      0.995    swe
  0.984     0.002      0.982     0.984     0.983      0.996    spa
  0.988     0          0.995     0.988     0.992      0.997    fre
  0.979     0.001      0.982     0.979     0.981      0.992    nld
  0.97      0.002      0.97      0.97      0.97       0.993    nor
  1         0          1         1         1          1        afr
  0.914     0.001      0.946     0.914     0.93       0.992    dan
  0.986     0.001      0.981     0.986     0.984      0.999    pot
  0.998     0.001      0.993     0.998     0.995      0.999    fin
  0.99      0.001      0.993     0.99      0.992      0.999    ita
  0.998     0          0.998     0.998     0.998      0.999    ger

=== Confusion Matrix ===

    a    b    c    d    e    f    g    h    i    j    k    l   <-- classified as
 1044    1    1    0    0    0    0    0    1    1    0    0 |    a = eng
    2  425    0    0    2    0    0    0    0    0    1    0 |    b = swe
    0    0  434    1    1    0    0    0    5    0    0    0 |    c = spa
    2    0    0  418    0    0    0    0    0    1    0    2 |    d = fre
    4    0    2    0  333    0    0    0    0    0    1    0 |    e = nld
    1    0    0    0    0  322    0    7    1    0    1    0 |    f = nor
    0    0    0    0    0    0  230    0    0    0    0    0 |    g = afr
    1    0    0    0    2   10    0  139    0    0    0    0 |    h = dan
    0    0    5    0    0    0    0    0  362    0    0    0 |    i = pot
    0    0    0    0    0    0    0    1    0  440    0    0 |    j = fin
    2    0    0    0    1    0    0    0    0    1  417    0 |    k = ita
    1    0    0    1    0    0    0    0    0    0    0 1002 |    l = ger



    root.addBranch("uralic");
    root.addBranch("uralic", "fino-ugric");
    root.addBranch("uralic", "ugric");
    //root.addLanguage("hungarian", "ugric");
    root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
    //root.addLanguage("sami", "fino-ugric");
    //root.addLanguage("estonian", "fino-ugric");
    //root.addLanguage("livonian", "fino-ugric");

    root.addBranch("proto-indo european");

    root.addBranch("proto-indo european", "italic");
    root.addBranch("italic", "latino-faliscan");
    root.addBranch("latino-faliscan", "latin");
    root.addLanguage("latin", "ita", "italian", "it", "Italia");
    root.addLanguage("latin", "fre", "french", "fr", "France");
    root.addLanguage("latin", "pot", "portugese", "pt", "Portugal");
    root.addLanguage("latin", "spa", "spanish", "es", "Espa%C3%B1a");

    root.addBranch("proto-indo european", "germanic");
    root.addBranch("germanic", "northern germanic");
    root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
    root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
    root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");

    root.addBranch("germanic", "west germanic");
    root.addLanguage("west germanic", "eng", "english", "en", "UK");
    root.addLanguage("west germanic", "ger", "german", "de", "Deutschland");

    root.addBranch("west germanic", "middle dutch");
    root.addLanguage("middle dutch", "nld", "dutch", "nl", "Nederland");
    root.addLanguage("middle dutch", "afr", "afrikaans", "af", "Nederland");
  

> Language detector
> -----------------
>
>                 Key: LUCENE-826
>                 URL: https://issues.apache.org/jira/browse/LUCENE-826
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>         Assigned To: Karl Wettin
>         Attachments: ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid
false positive classifications. 
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic
support vector models) feature selection and normalization of token freuencies.  Optionally
Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
>     LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root"));
>     root.addBranch("uralic");
>     root.addBranch("fino-ugric", "uralic");
>     root.addBranch("ugric", "uralic");
>     root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
>     root.addBranch("proto-indo european");
>     root.addBranch("germanic", "proto-indo european");
>     root.addBranch("northern germanic", "germanic");
>     root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
>     root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
>     root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
>     root.addBranch("west germanic", "germanic");
>     root.addLanguage("west germanic", "eng", "english", "en", "UK");
>     root.mkdirs();
>     LanguageClassifier classifier = new LanguageClassifier(root);
>     if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
>       classifier.compileTrainingData(); // from wikipedia
>     }
>     classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of each registred
language in the language to train. Above example pass this test:
> (testEquals is the same as assertEquals, just not required. Only one of them fail, see
comment.)
> {code}
>     assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
>     testEquals("swe", classifier.classify(norway_in_swedish).getISO());
>     testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
>     testEquals("swe", classifier.classify(finland_in_swedish).getISO());
>     testEquals("swe", classifier.classify(uk_in_swedish).getISO());
>     testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
>     assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
>     testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
>     testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
>     testEquals("fin", classifier.classify(norway_in_finnish).getISO());
>     testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
>     assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
>     testEquals("fin", classifier.classify(uk_in_finnish).getISO());
>     testEquals("dan", classifier.classify(sweden_in_danish).getISO());
>     // it is ok that this fails. dan and nor are very similar, and the document about
norway in danish is very small.
>     testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
>     assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
>     testEquals("dan", classifier.classify(finland_in_danish).getISO());
>     testEquals("dan", classifier.classify(uk_in_danish).getISO());
>     testEquals("eng", classifier.classify(sweden_in_english).getISO());
>     testEquals("eng", classifier.classify(norway_in_english).getISO());
>     testEquals("eng", classifier.classify(denmark_in_english).getISO());
>     testEquals("eng", classifier.classify(finland_in_english).getISO());
>     assertEquals("eng", classifier.classify(uk_in_english).getISO());
> {code}
> I don't know how well it works on lots of lanugages, but this fits my needs for now.
I'll try do more work on considering the language trees when classifying.
> It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled
arff-file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message