uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl <peter.klu...@averbis.com>
Subject Re: Using extensions
Date Wed, 28 Aug 2019 11:47:56 GMT
Hi,


we (Averbis) have an annotator which does exactly what you describe, but
unfortunetly I cannot share it.  However, I can tell that the annotator
is almost completely implemented in Ruta and uses no Ruta language
extensions.


If you want to learn more about language extensions, then there are
example projects in the Ruta trunk: ruta-core-ext and
example-projects/ruta-ep-example-extensions


If you want to build the annotator with Ruta rules, I can help you
create it.


As a starting point you need some dictionaries (wordtables) for numbers
(ein;1\neins;1\nzwei;2....) , exponents/multiplicators (tausend;3) and
special characters (½). For German that's not too much, maybe one
hundred entries overall is a good start.

Before you can apply the dictionaries, you need to split the RutaBasics
using some conjunction words in order to map the subword segments. You
can do that with a simple regex rule:

"und" -> ConjunctionFragment;

Then, you can write some rules that combine numbers using additions,
multiplications and exponents, e.g., something like:


FOREACH(num, false) NumericValue{}{

        // combination with multipliers like 3 million
        (num{IS(NumericValue)-> SHIFT(NumericValue,1,4)}
SPECIAL?{REGEXP("-"), NEAR(W,0,1,true)}
            (
                Multiplicator{-> num.value = (num.value * (POW(10,
Multiplicator.value)))}
                add2:NumericValue?{-> num.value = (num.value +
add2.value), UNMARK(add2)}
            )*);      


        // fünfundzwanzig
        (num{PARTOF(W)-> SHIFT(NumericValue,1,3)} ConjunctionFragment
add:NumericValue.value!=0{PARTOF(W), IF((NumericValue.value%1) == 0) ->
UNMARK(add)})
            {-> num.value = (num.value + add.value)};

}


At the end you get about 200 lines of Ruta ...




Best,


Peter

Am 27.08.2019 um 16:30 schrieb Dominik Terweh:
>
> Dear All,
>
>  
>
> When working with German written out numbers I figured, that in order
> to get what I want (the numeric value of a written number) I need to
> either hard code every single number name and use Wordtable or I need
> to work with the string. However, this made me thinking that this
> would probably be better done in a Language Extension. Unfortunately I
> am not sure how these work and how I can include them in my project.
> Also the manual did not really help me there
> (https://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.language.extensions).
>
>
>  
>
> Further I was wondering if there are any readily available extensions
> that can be used, e.g. to convert a string of number words into actual
> numbers (or replacing words on a dictionary basis, such as “one”:”1”,
> “two”:”2”,…), or an extension, that can evaluate a calculation in the
> form of a string (like “100*5+55”).  If something exists for number
> conversion it would be interesting to see if it does both, annotation
> and calculation, and how it handles different languages such as:
>
> 1) input is one token (like numbers in german, einundzwanzig)
>
> 2) input is several tokens jointly representing one number (like in
> english: twenty two)
>
> And mixed cases such as:
>
> 3) input is combination of number and string (like: 10 Millionen)
>
>  
>
> Thank you in advance for your help,
>
> Best
>
> Dominik
>
> Dominik Terweh
> Praktikant
>
> *Drooms GmbH*
> Eschersheimer Landstraße 6
> 60322 Frankfurt, Germany
> www.drooms.com <http://www.drooms.com>
>
> Phone: 	
> Mail: 	d.terweh@drooms.com <mailto:d.terweh@drooms.com>
>
> <https://drooms.com/en/newsletter?utm_source=newslettersignup&utm_medium=emailsignature>
>
> *Drooms GmbH*; Sitz der Gesellschaft / Registered Office:
> Eschersheimer Landstr. 6, D-60322 Frankfurt am Main; Geschäftsführung
> / Management Board: Alexandre Grellier;
> Registergericht / Court of Registration: Amtsgericht Frankfurt am
> Main, HRB 76454; Finanzamt / Tax Office: Finanzamt Frankfurt am Main,
> USt-IdNr.: DE 224007190
>
-- 
Dr. Peter Klügl
R&D Text Mining/Machine Learning

Averbis GmbH
Salzstr. 15
79098 Freiburg
Germany

Fon: +49 761 708 394 0
Fax: +49 761 708 394 10
Email: peter.kluegl@averbis.com
Web: https://averbis.com

Headquarters: Freiburg im Breisgau
Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080
Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message