uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl <peter.klu...@averbis.com>
Subject Re: Using extensions
Date Fri, 30 Aug 2019 14:38:49 GMT
Hi,

Am 29.08.2019 um 15:34 schrieb Dominik Terweh:
> Hey,
>
> I tried to understand the rules that you suggested and have a few questions (see below).
> What we have (successfully) implemented so far is a set of rules that change the value
of the stored string, in order to produce some kind of expression that is evaluated subsequently:
> a) replace numbers: "eins" becomes "(1)", "zwei|zwan" becomes "(2)"...
> b) replaced factors: "zig" becomes "*(10)", "hundert" becomes "*(100)".... and remove
"and"
> c) other ruta rules interpret the expression in chain-like order
>
> "dreimillionenzweitausendvierhunderteinundzwanzig"
> a) "(3)millionen(2)tausend(4)hundert(1)und(2)zig"
> b) "(3)*(1000000)(2)*(1000)(4)(100)(1)(20)"
> c) "(3)*(1000000)(2)*(1000)(400)(21)" => "(3)*(1000000)(2)*(1000)(421)" => "(3000000)(2000)(421)"
=> "(3000000)(2421)" => "(3002421)"
>
> However, we use replaceAll(string, pattern, patter) in all these transformations and
fear that it might not be the optimal solution for UIMA Ruta.
> Do you have any suggestion?


Why do you want to use a string feature to represent the numeric value?

I would assume that switching to a double/int feature makes it a lot
easier as you can directly perform the calculations.

Btw here's our type system for numeric values:

https://github.com/averbis/core-typesystems/blob/master/numeric-value-typesystem/src/main/resources/de/averbis/textanalysis/typesystems/NumericValueTypeSystem.xml


>
> Here are the questions for your rules:
> 1)
>> Before you can apply the dictionaries, you need to split the RutaBasics  using some
conjunction words in order to map the subword segments.
> How exactly can I do that? I know there is SPLIT() but that can only split an annotation
> on the basic of another inlaying one, or do I understand it wrong?
> Because if I could split words then German agglutinated numbers would be no problem (since
we have a working solution for English).


In Ruta, you can use simple regex rules for splitting up annotations. If
you have a rule like:
"und" -> ConjunctionFragment;

Then the "und" within the word fünfundzwanzig is annotated with the type
ConjunctionFragment since the simple regex rules are not bound to
annotations at all.
However, as a result, the RutaBasics will be updated. First there was
only one for the W, afterwards there are three. The WORDTABLE operates
on RutaBasic annotations and therefore is able to find "fünf"=5 and
"zwanzig"=20


> 2)
> Is there a special reason, why you use 3 for 'thousand', when you use it with POW(10,
x)? Intuitively I would just use 1000.


No, I think someone (me?) thought it would be more elegant.


>
> 3)
> In your "combination with multipliers like 3 million"-rule (Rule 1), you shift the annotation
to span over (1,4), should it not be (1,3)?


ah yes, that's a typo.


> 4)
> In Rule 1, is num{IS(NumericValue) )-> SHIFT(NumericValue,1,4)} just a different way
of writing num:NumericValue{)-> SHIFT(NumericValue,1,4)}?


The "num" is the variable of the FOREACH block, which in this case
operates from right to left.

So, all rules of the block are performed on the each NumericValue
successively. It is a bit more like an FST. The reverse order was
selected due to some calculations.

Your second rule would be performed on all NumericValue before the next
rule is executed.


>
> 5)
> What exactly is the function of the NEAR() in your Rule 1? Is it there do match only
"3", "3-Million" and "3-Million" but not "3-"?


Yes.

(Actually, I would not use NEAR here)


> 6)
> I tried to play Rule 1 through in my head with "zweitausendeins" and "dreimillionenzweitausendeins":
> This works good for the first example


This rule was maybe not a good example afterall.

I have to check it in the context of the block, but AFAIR it would not
be applied for these examples in our rule set (but others).


Best,


Peter


> (num{IS(NumericValue)-> SHIFT(NumericValue,1,4)}
> //value = 2
>
>   (Multiplicator{-> num.value = (2 * (POW(10,3)))}
> //value = 2000
>     add2:NumericValue?{-> num.value = (2000 + 1), UNMARK(add2)}));
> //value = 2001
>
>
> But fails for the second:
>
> (num{IS(NumericValue)-> SHIFT(NumericValue,1,4)}
> //value = 3
>
>   (Multiplicator{-> num.value = (2 * (POW(10,6)))}
> //value = 3000000
>     add2:NumericValue?{-> num.value = (3000000 + 2), UNMARK(add2)})
> //value = 3000002, after 1st iteration
>
>   (Multiplicator{-> num.value = (3000002 * (POW(10,3)))}
> //value = 3000002000
>     add2:NumericValue?{-> num.value = (3000002000+ 1), UNMARK(add2)}));
> //value = 3000002001
>
> On 28.08.19, 13:48, "Peter Klügl" <peter.kluegl@averbis.com> wrote:
>
>     Hi,
>
>
>     we (Averbis) have an annotator which does exactly what you describe, but
>     unfortunetly I cannot share it.  However, I can tell that the annotator
>     is almost completely implemented in Ruta and uses no Ruta language
>     extensions.
>
>
>     If you want to learn more about language extensions, then there are
>     example projects in the Ruta trunk: ruta-core-ext and
>     example-projects/ruta-ep-example-extensions
>
>
>     If you want to build the annotator with Ruta rules, I can help you
>     create it.
>
>
>     As a starting point you need some dictionaries (wordtables) for numbers
>     (ein;1\neins;1\nzwei;2....) , exponents/multiplicators (tausend;3) and
>     special characters (½). For German that's not too much, maybe one
>     hundred entries overall is a good start.
>
>     Before you can apply the dictionaries, you need to split the RutaBasics
>     using some conjunction words in order to map the subword segments. You
>     can do that with a simple regex rule:
>
>     "und" -> ConjunctionFragment;
>
>     Then, you can write some rules that combine numbers using additions,
>     multiplications and exponents, e.g., something like:
>
>
>     FOREACH(num, false) NumericValue{}{
>
>             // combination with multipliers like 3 million
>             (num{IS(NumericValue)-> SHIFT(NumericValue,1,4)}
>     SPECIAL?{REGEXP("-"), NEAR(W,0,1,true)}
>                 (
>                     Multiplicator{-> num.value = (num.value * (POW(10,
>     Multiplicator.value)))}
>                     add2:NumericValue?{-> num.value = (num.value +
>     add2.value), UNMARK(add2)}
>                 )*);
>
>
>             // fünfundzwanzig
>             (num{PARTOF(W)-> SHIFT(NumericValue,1,3)} ConjunctionFragment
>     add:NumericValue.value!=0{PARTOF(W), IF((NumericValue.value%1) == 0) ->
>     UNMARK(add)})
>                 {-> num.value = (num.value + add.value)};
>
>     }
>
>
>     At the end you get about 200 lines of Ruta ...
>
>
>
>
>     Best,
>
>
>     Peter
>
>     Am 27.08.2019 um 16:30 schrieb Dominik Terweh:
>     >
>     > Dear All,
>     >
>     >
>     >
>     > When working with German written out numbers I figured, that in order
>     > to get what I want (the numeric value of a written number) I need to
>     > either hard code every single number name and use Wordtable or I need
>     > to work with the string. However, this made me thinking that this
>     > would probably be better done in a Language Extension. Unfortunately I
>     > am not sure how these work and how I can include them in my project.
>     > Also the manual did not really help me there
>     > (https://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.language.extensions).
>     >
>     >
>     >
>     >
>     > Further I was wondering if there are any readily available extensions
>     > that can be used, e.g. to convert a string of number words into actual
>     > numbers (or replacing words on a dictionary basis, such as “one”:”1”,
>     > “two”:”2”,…), or an extension, that can evaluate a calculation in
the
>     > form of a string (like “100*5+55”).  If something exists for number
>     > conversion it would be interesting to see if it does both, annotation
>     > and calculation, and how it handles different languages such as:
>     >
>     > 1) input is one token (like numbers in german, einundzwanzig)
>     >
>     > 2) input is several tokens jointly representing one number (like in
>     > english: twenty two)
>     >
>     > And mixed cases such as:
>     >
>     > 3) input is combination of number and string (like: 10 Millionen)
>     >
>     >
>     >
>     > Thank you in advance for your help,
>     >
>     > Best
>     >
>     > Dominik
>     >
>     > Dominik Terweh
>     > Praktikant
>     >
>     > *Drooms GmbH*
>     > Eschersheimer Landstraße 6
>     > 60322 Frankfurt, Germany
>     > www.drooms.com <http://www.drooms.com>
>     >
>     > Phone:
>     > Mail: d.terweh@drooms.com <mailto:d.terweh@drooms.com>
>     >
>     > <https://drooms.com/en/newsletter?utm_source=newslettersignup&utm_medium=emailsignature>
>     >
>     > *Drooms GmbH*; Sitz der Gesellschaft / Registered Office:
>     > Eschersheimer Landstr. 6, D-60322 Frankfurt am Main; Geschäftsführung
>     > / Management Board: Alexandre Grellier;
>     > Registergericht / Court of Registration: Amtsgericht Frankfurt am
>     > Main, HRB 76454; Finanzamt / Tax Office: Finanzamt Frankfurt am Main,
>     > USt-IdNr.: DE 224007190
>     >
>     --
>     Dr. Peter Klügl
>     R&D Text Mining/Machine Learning
>
>     Averbis GmbH
>     Salzstr. 15
>     79098 Freiburg
>     Germany
>
>     Fon: +49 761 708 394 0
>     Fax: +49 761 708 394 10
>     Email: peter.kluegl@averbis.com
>     Web: https://averbis.com
>
>     Headquarters: Freiburg im Breisgau
>     Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080
>     Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó
>
>
>
-- 
Dr. Peter Klügl
R&D Text Mining/Machine Learning

Averbis GmbH
Salzstr. 15
79098 Freiburg
Germany

Fon: +49 761 708 394 0
Fax: +49 761 708 394 10
Email: peter.kluegl@averbis.com
Web: https://averbis.com

Headquarters: Freiburg im Breisgau
Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080
Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó


Mime
View raw message