Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 77EDC10CF1 for ; Fri, 18 Oct 2013 13:59:24 +0000 (UTC) Received: (qmail 15382 invoked by uid 500); 18 Oct 2013 13:59:14 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 14890 invoked by uid 500); 18 Oct 2013 13:59:05 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 14397 invoked by uid 99); 18 Oct 2013 13:58:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Oct 2013 13:58:57 +0000 X-ASF-Spam-Status: No, hits=-5.0 required=5.0 tests=RCVD_IN_DNSWL_HI,SPF_PASS,UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [77.87.224.108] (HELO m4-bln.bund.de) (77.87.224.108) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Oct 2013 13:58:52 +0000 Received: from m4.mfw.bln.ivbb.bund.de (localhost.mfw.bln.ivbb.bund.de [127.0.0.1]) by m4-bln.bund.de (8.14.3/8.14.3) with ESMTP id r9IDwUib032035 for ; Fri, 18 Oct 2013 15:58:30 +0200 (CEST) Received: (from localhost) by m4.mfw.bln.ivbb.bund.de (MSCAN) id 8/m4.mfw.bln.ivbb.bund.de/smtp-gw/mscan; Fri Oct 18 15:58:30 2013 X-P350-Id: 1309f3b8e47dec50 From: To: Subject: AW: Working with very large text documents Thread-Topic: Working with very large text documents Thread-Index: AQHOy+1HJS4WFOa67kKoalieqPYPJJn6cOeA Date: Fri, 18 Oct 2013 13:58:12 +0000 Message-ID: References: <88BA6AA7-F9E3-4003-8078-1317C75C06F0@apache.org> In-Reply-To: Accept-Language: de-DE, en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: MIME-Version: 1.0 Content-Type: multipart/signed; boundary="bar92f834cf79adc84178d821b8777fb942"; micalg=pgp-sha1; protocol="application/pgp-signature" X-Virus-Checked: Checked by ClamAV on apache.org --bar92f834cf79adc84178d821b8777fb942 Content-Language: de-DE Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Dear Jens, dear Richard, Looks like I have to use a log file specific pipeline. The problem was that= I did not knew it before the process crashed. It would be so nice having a= general approach. Thanks, Armin -----Urspr=FCngliche Nachricht----- Von: Richard Eckart de Castilho [mailto:rec@apache.org]=20 Gesendet: Freitag, 18. Oktober 2013 12:32 An: user@uima.apache.org Betreff: Re: Working with very large text documents Hi Armin, that's a good point. It's also an issue with UIMA then, because the begin/e= nd offsets are likewise int values. If it is a log file, couldn't you split it into sections of e.g. one CAS per day and analyze each one. If there are long-distance relations = that span days, you could add a second pass which reads in all analyzed cas= es for a rolling window of e.g. 7 days and tries to find the long distance = relations in that window. -- Richard On 18.10.2013, at 10:48, Armin.Wegner@bka.bund.de wrote: > Hi Richard, >=20 > As far as I know, Java strings can not be longer than 2 GB on 64bit VMs. >=20 > Armin >=20 > -----Urspr=FCngliche Nachricht----- > Von: Richard Eckart de Castilho [mailto:rec@apache.org] > Gesendet: Freitag, 18. Oktober 2013 10:43 > An: user@uima.apache.org > Betreff: Re: Working with very large text documents >=20 > On 18.10.2013, at 10:06, Armin.Wegner@bka.bund.de wrote: >=20 >> Hi, >>=20 >> What are you doing with very large text documents in an UIMA Pipeline, f= or example 9 GB in size. >=20 > In that order of magnitude, I'd probably try to get a computer with=20 > more memory ;) >=20 >> A. I expect that you split the large file before putting it into the pip= eline. Or do you use a multiplier in the pipeline to split it? Anyway, wher= e do you split the input file? You can not just split it anywhere. There is= a not so slight possibility to break the content. Is there a preferred chu= nk size for UIMA? >=20 > The chunk size would likely not depend on UIMA, but rather on the machine= you are using. If you cannot split the data in defined locations, maybe yo= u can use a windowing approach where two splits have a certain overlap? >=20 >> B. Another possibility might be not to save the data in the CAS at all a= nd use an URI reference instead. It's up to the analysis engine then how to= load the data. My first idea was to use java.util.Scanner for regular expr= essions for examples. But I think that you need to have the whole text load= ed to iterator over annotations. Or is just AnnotationFS.getCoveredText() n= ot working. Any suggestions here? >=20 > No idea unfortunately, never used the stream so far. >=20 > -- Richard >=20 >=20 --bar92f834cf79adc84178d821b8777fb942 Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQEcBAABAgAGBQJSYT53AAoJEAk50sqYef+/2qYIAMvz10VmcrurxG9LF95W9MKT +rVU501AzwUtc61daPQAjv602fShA0v9mSbScK/UrjLxkuTxlYqxCqyU2z1dGp8D joC1uzQy145oibD7NKAC+DtR877MwsNXMRDFS6HW+sNQeege8UzNP/sN38W9JbrQ JR9h5kiQUIS/78MzI0RoXyJ7807Xs2W56XXBKTwqNwcyX1K7Oye03FO+Uc71Q916 HsBiRFzr7XyxmjaFn7UK3TjSU20tZESL9Sd5Q+uquyb8im/gA49Eli5RasUWmiG5 xq4NwtmZfubgcSMahc4ChBeEameP9Os6cvMtG3sPeJSp2ioWGm/8s8NkJZIp798= =VfGc -----END PGP SIGNATURE----- --bar92f834cf79adc84178d821b8777fb942--