From user-return-8039-archive-asf-public=cust-asf.ponee.io@uima.apache.org Wed Feb 20 00:14:18 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 9C1F918060E for ; Wed, 20 Feb 2019 01:14:17 +0100 (CET) Received: (qmail 79452 invoked by uid 500); 20 Feb 2019 00:14:16 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 79342 invoked by uid 99); 20 Feb 2019 00:14:16 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Feb 2019 00:14:16 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 95A9F188D36 for ; Wed, 20 Feb 2019 00:14:15 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.798 X-Spam-Level: * X-Spam-Status: No, score=1.798 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id UYVdoxOpf2xn for ; Wed, 20 Feb 2019 00:14:13 +0000 (UTC) Received: from mail-lj1-f170.google.com (mail-lj1-f170.google.com [209.85.208.170]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 54FE95F36D for ; Wed, 20 Feb 2019 00:14:12 +0000 (UTC) Received: by mail-lj1-f170.google.com with SMTP id z25so11370563ljk.8 for ; Tue, 19 Feb 2019 16:14:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=s80FT/FvARgnRGQq08RpagTxUZYZe5Ihu/X6VnqtzHI=; b=AXkdUO/1Hbp4M0ASPaF3BSRMFNlLgdHaMzO6u/bcAK8ns2f5Tj+OU1UXidVGH8VF2S S4xq0L6IhzJUSXahUM/Lq4VnYiohwKk13NHmJIfUH0FPYLiWyeBZH5eiuJJY0A1/lQoS +znmZG0nWCzNlWVWY7OD6QQo0FNkFM767tVVwySOQAL35kNTEvShmcUbo9VR9aQGpPSH g5/iUPcHFTqoHbwEyXFMQ7WbPJXeVJGlOZkMSYnRMrr3IkONE0yd3IWAgwrw/MOXeGMJ uZ/b36BkBdHgtIOWNxZlIhL9E5p5JqTDGldDueDVjpxxir1C1UDOjHTf7ePKnVrseyHu RlkQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=s80FT/FvARgnRGQq08RpagTxUZYZe5Ihu/X6VnqtzHI=; b=tOzkmZTNvex6kEiTi05N8Y7LvwImGn4UY6H0u2N6mjdrBfbA8J8sxmvRx71miXYx/C AuTr83VGaxOX289DJ+Hu1OwY6EuCgWCYTQoXWRgMW1EfDzBSCCzhQ2FshoGZVHANshaP Mg17ILC+BRi+aIZGnC1dMOyq0dOkuNwF0eHopv7lwvoXv+DWL7DAJXlvVQmBv3MlKN0t L4MdmzJ1c2i2mF3XHgjwOB9v2oUaju+vFipZw6059SCb30vaYLB2e8/yfCjMqSMxj48e shxGdmy5gr/Shlnq4CGphM11oXXV2UdtLEEEcHhz/G504VzdZt8pHFbwDXz+3YSDlNUI jJKQ== X-Gm-Message-State: AHQUAuYFqR04EJN15DhnmHGrOemExBMa3X9ToUHkKY9c/ZPhy2S6H+cg mDLk74IfQa9qWo/fsrmnM5pXfcoPBhVbo+Z7wElZJA== X-Google-Smtp-Source: AHgI3Iah+qICXutjgtmAYKXXY04HkLstK2ReizXzTOfDsGwV9d450Lzr6XIfQXdDiI1KH2QpqzeWJNtQy9nKJqrQrVY= X-Received: by 2002:a2e:980e:: with SMTP id a14mr14870884ljj.177.1550621645838; Tue, 19 Feb 2019 16:14:05 -0800 (PST) MIME-Version: 1.0 From: Bonnie MacKellar Date: Tue, 19 Feb 2019 19:13:55 -0500 Message-ID: Subject: XML files as input to UIMA? To: user@uima.apache.org Content-Type: multipart/alternative; boundary="000000000000619ed00582483c2c" --000000000000619ed00582483c2c Content-Type: text/plain; charset="UTF-8" This is probably a very naive question, but I can't seem to find anything about this. I currently have a lot of XML files (clinical trial descriptions). My current workflow is to run a preprocessor that parses the XML and generates text files in a simple format. I then run these files in a UIMA pipeline, using FileCollectionReader to load the text files, RUTA to parse the simple format, the Metamap annotator to do some UMLS annotations, and finally I have a writer that generates RDF triples from the UMIA annotations and loads the triples into a database. This has worked but is clunky, especially the preprocessing. I feel like there has to be a better way. Is there any support for reading XML files or do I need to write my own CollectionReader? Are there any other tools within UIMA for handling XML text? thanks, Bonnie MacKellar --000000000000619ed00582483c2c--