From user-return-8179-archive-asf-public=cust-asf.ponee.io@uima.apache.org Thu Aug 29 09:40:11 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 98B7D180608 for ; Thu, 29 Aug 2019 11:40:11 +0200 (CEST) Received: (qmail 38946 invoked by uid 500); 29 Aug 2019 09:40:11 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 38926 invoked by uid 99); 29 Aug 2019 09:40:10 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Aug 2019 09:40:10 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 69338C060F for ; Thu, 29 Aug 2019 09:40:09 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.801 X-Spam-Level: * X-Spam-Status: No, score=1.801 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id UGTGaelDvdgJ for ; Thu, 29 Aug 2019 09:40:07 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::e2e; helo=mail-vs1-xe2e.google.com; envelope-from=talpus@gmail.com; receiver= Received: from mail-vs1-xe2e.google.com (mail-vs1-xe2e.google.com [IPv6:2607:f8b0:4864:20::e2e]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 5BBFE7D3FC for ; Thu, 29 Aug 2019 09:40:06 +0000 (UTC) Received: by mail-vs1-xe2e.google.com with SMTP id b187so1963034vsc.9 for ; Thu, 29 Aug 2019 02:40:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=4jGo5PrNDjVIP9grm3kxxL32RGE8XI4zDNPdNdA0mcI=; b=B8yDH36TDqWATxd9+GM95TN1GofAktBwkhuG2ihV3pp+4RkfBeuM6fIu9FnXsVq4zP V86F4UqnN0+XJ0oNZ+2V+1uIMGSwpjBnaz9RZAW92vF/nUOxrsMCYDjRVDcDk3M42oyk hP22+qXZoWTuQb3tO7znYYsIdA8c++Gv1o/zWnyqxJojt8XNK4e6mdNArDwb3cuIUTWp Vu415ME74cMcXwONYOhJtNKRFmP19kAGN3sXvweh+hcgaxd9YTLu1FLXepk4HizVCoZB izROv+N7mL/4n4p6NHp5IVBFK5n9ET+4Bd1+NdiK5D4rX/JY1Q3Bvd5DEn3q+7sjx6eW EJCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=4jGo5PrNDjVIP9grm3kxxL32RGE8XI4zDNPdNdA0mcI=; b=JXc9cmjUlR6h1ZD3JhBZOLGMMPKRe9/l/1CZXKldo6fJy6i/dlWrOZQGF6cjsY+rhp c+3uNolTuKOkq9yOmgWBl8cfWBF4+ez+JbeVL4NZtj1EQL3x7BHgGt5NVGtxR/et/9D2 scBRQwyuHqUlwhzzjJECyCbhBjB8scyetASTWZFLv2rm7qV8Inu/2X8Jhmj5HhxYbs2W fw7pirzNogl6pZNA+FDW15p6b1IXgMao9vcHnfs067UJ1mdjdEjUjBriXgAaxgQW3cTR SV+pTpZjoanzxvYUqVLcJwQhHlJF+4Kg7HgWypJPabC8loOuZTReuxu7dMJwkCaI8OQw nUSA== X-Gm-Message-State: APjAAAWeH7WKnyNrw8I/VKxOneQAMmYH8rVx+sLxD+cySQvsuv792Pg5 SpL0XbHa4etEaidtrDzrp1xGbLYmwLvoOSEBaYjGVi6X X-Google-Smtp-Source: APXvYqxUQ941b5wjk+I+q6sgogU872QVceOzDsDKnoIQ9574/5TscIl8g8MFrSzMGZNZNFPdX39TBa6gBejeMdfL6YM= X-Received: by 2002:a67:fd58:: with SMTP id g24mr4876165vsr.91.1567071604900; Thu, 29 Aug 2019 02:40:04 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Nikolai Krot Date: Thu, 29 Aug 2019 11:39:38 +0200 Message-ID: Subject: Re: Using extensions To: user@uima.apache.org Content-Type: multipart/alternative; boundary="000000000000590b6c05913e4aed" --000000000000590b6c05913e4aed Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Peter, From *your* perspective, for this particular task of turning written out numbers to their numerical representation, what would be better to implement it as a language extension (=3D one additional function) or a set of ruta rules? Against language extension speaks the fact that such conversion may be language-dependent, that is, it does no generalize well. On the other hand, the language extension may be faster that plain ruta rules. Is the implementation of this functionality that you have at your company good in terms of speed? Best regards, Nikolai KROT On Wed, Aug 28, 2019 at 1:48 PM Peter Kl=C3=BCgl wrote: > Hi, > > > we (Averbis) have an annotator which does exactly what you describe, but > unfortunetly I cannot share it. However, I can tell that the annotator > is almost completely implemented in Ruta and uses no Ruta language > extensions. > > > If you want to learn more about language extensions, then there are > example projects in the Ruta trunk: ruta-core-ext and > example-projects/ruta-ep-example-extensions > > > If you want to build the annotator with Ruta rules, I can help you > create it. > > > As a starting point you need some dictionaries (wordtables) for numbers > (ein;1\neins;1\nzwei;2....) , exponents/multiplicators (tausend;3) and > special characters (=C2=BD). For German that's not too much, maybe one > hundred entries overall is a good start. > > Before you can apply the dictionaries, you need to split the RutaBasics > using some conjunction words in order to map the subword segments. You > can do that with a simple regex rule: > > "und" -> ConjunctionFragment; > > Then, you can write some rules that combine numbers using additions, > multiplications and exponents, e.g., something like: > > > FOREACH(num, false) NumericValue{}{ > > // combination with multipliers like 3 million > (num{IS(NumericValue)-> SHIFT(NumericValue,1,4)} > SPECIAL?{REGEXP("-"), NEAR(W,0,1,true)} > ( > Multiplicator{-> num.value =3D (num.value * (POW(10, > Multiplicator.value)))} > add2:NumericValue?{-> num.value =3D (num.value + > add2.value), UNMARK(add2)} > )*); > > > // f=C3=BCnfundzwanzig > (num{PARTOF(W)-> SHIFT(NumericValue,1,3)} ConjunctionFragment > add:NumericValue.value!=3D0{PARTOF(W), IF((NumericValue.value%1) =3D=3D 0= ) -> > UNMARK(add)}) > {-> num.value =3D (num.value + add.value)}; > > } > > > At the end you get about 200 lines of Ruta ... > > > > > Best, > > > Peter > > Am 27.08.2019 um 16:30 schrieb Dominik Terweh: > > > > Dear All, > > > > > > > > When working with German written out numbers I figured, that in order > > to get what I want (the numeric value of a written number) I need to > > either hard code every single number name and use Wordtable or I need > > to work with the string. However, this made me thinking that this > > would probably be better done in a Language Extension. Unfortunately I > > am not sure how these work and how I can include them in my project. > > Also the manual did not really help me there > > ( > https://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.rut= a.language.extensions > ). > > > > > > > > > > Further I was wondering if there are any readily available extensions > > that can be used, e.g. to convert a string of number words into actual > > numbers (or replacing words on a dictionary basis, such as =E2=80=9Cone= =E2=80=9D:=E2=80=9D1=E2=80=9D, > > =E2=80=9Ctwo=E2=80=9D:=E2=80=9D2=E2=80=9D,=E2=80=A6), or an extension, = that can evaluate a calculation in the > > form of a string (like =E2=80=9C100*5+55=E2=80=9D). If something exist= s for number > > conversion it would be interesting to see if it does both, annotation > > and calculation, and how it handles different languages such as: > > > > 1) input is one token (like numbers in german, einundzwanzig) > > > > 2) input is several tokens jointly representing one number (like in > > english: twenty two) > > > > And mixed cases such as: > > > > 3) input is combination of number and string (like: 10 Millionen) > > > > > > > > Thank you in advance for your help, > > > > Best > > > > Dominik > > > > Dominik Terweh > > Praktikant > > > > *Drooms GmbH* > > Eschersheimer Landstra=C3=9Fe 6 > > 60322 Frankfurt, Germany > > www.drooms.com > > > > Phone: > > Mail: d.terweh@drooms.com > > > > < > https://drooms.com/en/newsletter?utm_source=3Dnewslettersignup&utm_medium= =3Demailsignature > > > > > > *Drooms GmbH*; Sitz der Gesellschaft / Registered Office: > > Eschersheimer Landstr. 6, D-60322 Frankfurt am Main; Gesch=C3=A4ftsf=C3= =BChrung > > / Management Board: Alexandre Grellier; > > Registergericht / Court of Registration: Amtsgericht Frankfurt am > > Main, HRB 76454; Finanzamt / Tax Office: Finanzamt Frankfurt am Main, > > USt-IdNr.: DE 224007190 > > > -- > Dr. Peter Kl=C3=BCgl > R&D Text Mining/Machine Learning > > Averbis GmbH > Salzstr. 15 > 79098 Freiburg > Germany > > Fon: +49 761 708 394 0 > Fax: +49 761 708 394 10 > Email: peter.kluegl@averbis.com > Web: https://averbis.com > > Headquarters: Freiburg im Breisgau > Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080 > Managing Directors: Dr. med. Philipp Daumke, Dr. Korn=C3=A9l Mark=C3=B3 > > --000000000000590b6c05913e4aed--