From user-return-8048-archive-asf-public=cust-asf.ponee.io@uima.apache.org Fri Feb 22 16:49:11 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 8A69E180648 for ; Fri, 22 Feb 2019 17:49:10 +0100 (CET) Received: (qmail 85414 invoked by uid 500); 22 Feb 2019 16:49:09 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 85394 invoked by uid 99); 22 Feb 2019 16:49:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Feb 2019 16:49:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 8FCE4182B7E for ; Fri, 22 Feb 2019 16:49:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.904 X-Spam-Level: * X-Spam-Status: No, score=1.904 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FROM_EXCESS_BASE64=0.105, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id waaFh5W7y6v2 for ; Fri, 22 Feb 2019 16:49:06 +0000 (UTC) Received: from mail-lf1-f44.google.com (mail-lf1-f44.google.com [209.85.167.44]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id CAE445F530 for ; Fri, 22 Feb 2019 16:41:05 +0000 (UTC) Received: by mail-lf1-f44.google.com with SMTP id t14so2194883lfk.7 for ; Fri, 22 Feb 2019 08:41:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=Is/gajxBehy7kKpp9J1blE4Q+4zoaIiB/tWNA9Gfsv8=; b=uRGJ+iyOHqjw4h/DqXrRldYUOgL/sUiybVUdQRNxEkzsxHR0YRLqmiEjKPPd4XigOR 96FUcCFtnPuPnxv++RHGYH3vnoZW1F0/FZsFhe8V5ykluHaHqJMUJdgW8Ze3pOd5xG5Q YbqyYFW1PVzXhMTRuTzqsgacuyaX6vT9A9m+GQwytmOoorlrLhMyU4xvBxnnG5B7jDQC NwgSBJI5xCBz8EjBbm1rEpyAlaz/NQX/LeNZmJommPZuDV7UFR1a2xgANZV6F3pDuXGI KcRoBLJAg51tCBlw9Eeomjzih5rKAIDu3V03P1TX6FINwQtbmydWPRXQcK+skdYUEBFo A8nQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=Is/gajxBehy7kKpp9J1blE4Q+4zoaIiB/tWNA9Gfsv8=; b=E4I8pPuuq2m0nX2vrL9mcZhJA3ZXD72PD9P1nO/3A5rx9wreeOqMVA2vrxxc86IfwT efFGfyZkvwo3ZX8IIhjzUcRXGDULSNbJPAcHFlFiRZ2a5OQZsSYYpY18O6o6QLPdZYlK xGKahZp33c81l9QighnTyELl+Zt78xBME1cvh+bX7m0JUH7PmrEvQVI2YGiHNozrJZ0f l0U4nsv4kctIRXXUiGplAaddV956Ps6sEf7ucLRiEejzIt2xtlkwjZzq1ryYnFgobwO5 139MIHy1DjcCIOGA28tfkFvU8HverwI4c8HR1iTTb7dDVCSgpdw0W5TLtgjs5Kde1lsU Wsuw== X-Gm-Message-State: AHQUAuYhCGFusfL8ylY7l4SvDBK0QJs4kpDMq7T+0f7vfzi/Nf80ldbn L/txcgVj7m9s0Jt6SmJ7Ln36hu5d9/kUUqAu9V7U1JR1 X-Google-Smtp-Source: AHgI3IYMh5erTceKWG1GEIY79Lj3qjJ9hpQPo2yEJHnvinCO0gcf4b6k7fzurRst58gJ2BlPC/zsvm4Za5F4WfP8Ul4= X-Received: by 2002:ac2:518e:: with SMTP id u14mr2862103lfi.76.1550853658476; Fri, 22 Feb 2019 08:40:58 -0800 (PST) MIME-Version: 1.0 References: <4401A756-DA9E-4382-B949-80EE1C47E33F@uni-jena.de> <65BF3144-0355-4094-99BD-2F7547B868C0@uni-jena.de> In-Reply-To: From: =?UTF-8?B?Sm9zw6kgVG9tw6FzIEF0cmlh?= Date: Fri, 22 Feb 2019 13:40:21 -0300 Message-ID: Subject: Re: XML files as input to UIMA? To: user@uima.apache.org Content-Type: multipart/alternative; boundary="0000000000006984eb05827e41a3" --0000000000006984eb05827e41a3 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I implemented a fairly general XML collection reader using a SAX parser that takes a handler resource that can implement the necessary logic for dealing with the idiosyncrasies of different encoding schemes. It was originally based on DKPro's XML readers, which are also very easy to adapt to different formats. Mine is available here: https://github.com/jtatria/lector/tree/master/src/main/java/edu/columbia/in= cite/uima/io It uses two components to implement a given format's logic: A "TextFilter" resource to normalize SOFA text from XML character data and a "MappingProvider" that implements the logic needed to process XML elements (typically by mapping them to UIMA annotations). If you already have coded all the logic for dealing with your source material, it should not be too hard to adapt it for use with these components. Hope it is of some use. I'd be happy to answer any questions you may have. best, jta On Fri, Feb 22, 2019 at 9:17 AM Bonnie MacKellar wrote: > Thanks so much! > > Bonnie MacKellar > > On Fri, Feb 22, 2019 at 7:03 AM Erik F=C3=A4=C3=9Fler > wrote: > > > Hey, > > > > just wanted to say that I didn=E2=80=99t come around to make the compon= ent > > available yet, will do first thing next week! > > > > Best, > > > > Erik > > > > > On 20. Feb 2019, at 19:47, Bonnie MacKellar > > wrote: > > > > > > Hi, > > > Yes, we are using that format. I have a parser that I wrote, but it > isn't > > > integrated into UIMA. It runs separately and loads the full clinical > > trial > > > data into a triplestore (Stardog). I would be interested in your syst= em > > > since I am not really familiar with how to write file readers in the > UMIA > > > framework. Perhaps I can merge my parser into it and end up with just > the > > > right thing. If you can make it available, I would definitely be > > > interested. I will take a look at the other links as well. Thanks!! > > > > > > Bonnie MacKellar > > > > > > On Wed, Feb 20, 2019 at 3:54 AM Erik F=C3=A4=C3=9Fler > > > > wrote: > > > > > >> Dear Bonnie, > > >> > > >> are you talking about the clinical trial XML format used by > > >> ClinicalTrials. gov by any chance? > > >> If so, I did create a UIMA reader for these data. Its not perfect bu= t > > >> perhaps enough for your purposes and also you might want to enhance > it. > > >> Please let me know if you would be interested in that, I did not get > > >> around to make it publicly available yet but could do so quickly. > > >> > > >> To answer the general question to the best of my knowledge: > > >> There is no such thing as a general XML reader built-in into the UIM= A > > >> framework. For all non-trivial formats, a specific reader is > necessary. > > >> This also holds true with regard to the employed type system. > > >> That being said, there are UIMA readers that try to serve as a gener= al > > XML > > >> reading facility, e.g. the =E2=80=9CXML Reader=E2=80=9D from our lab= (JULIELab, > > >> https://github.com/JULIELab/jcore-base/tree/master/jcore-xml-reader = < > > >> https://github.com/JULIELab/jcore-base/tree/master/jcore-xml-reader > >). > > >> However, in my experience XML inputs come in a lot of different form= s > > >> which might often not be suitable to a generic approach which is why= I > > >> wrote quite a few UIMA readers for specific XML formats in the past. > > >> > > >> Hope that helps, > > >> > > >> Erik > > >> > > >>> On 20. Feb 2019, at 01:13, Bonnie MacKellar > > >> wrote: > > >>> > > >>> This is probably a very naive question, but I can't seem to find > > anything > > >>> about this. I currently have a lot of XML files (clinical trial > > >>> descriptions). My current workflow is to run a preprocessor that > parses > > >> the > > >>> XML and generates text files in a simple format. I then run these > files > > >> in > > >>> a UIMA pipeline, using FileCollectionReader to load the text files, > > RUTA > > >> to > > >>> parse the simple format, the Metamap annotator to do some UMLS > > >> annotations, > > >>> and finally I have a writer that generates RDF triples from the UMI= A > > >>> annotations and loads the triples into a database. This has worked > but > > is > > >>> clunky, especially the preprocessing. I feel like there has to be a > > >> better > > >>> way. Is there any support for reading XML files or do I need to > write > > my > > >>> own CollectionReader? Are there any other tools within UIMA for > > handling > > >>> XML text? > > >>> > > >>> thanks, > > >>> Bonnie MacKellar > > >> > > >> > > > > > --=20 entia non sunt multiplicanda praeter necessitatem --0000000000006984eb05827e41a3--