Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 42DB2106EA for ; Fri, 18 Oct 2013 10:38:16 +0000 (UTC) Received: (qmail 53480 invoked by uid 500); 18 Oct 2013 10:38:14 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 53462 invoked by uid 500); 18 Oct 2013 10:38:14 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 53447 invoked by uid 99); 18 Oct 2013 10:38:13 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Oct 2013 10:38:13 +0000 Received: from localhost (HELO highfire.ukp.informatik.tu-darmstadt.de) (127.0.0.1) (smtp-auth username rec, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Oct 2013 10:38:13 +0000 Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Mac OS X Mail 6.6 \(1510\)) Subject: Re: Working with very large text documents From: Richard Eckart de Castilho In-Reply-To: Date: Fri, 18 Oct 2013 12:38:11 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <605761DA-EF17-4B76-A139-39CCAD74E42C@apache.org> References: To: user@uima.apache.org X-Mailer: Apple Mail (2.1510) Well, assuming this would e.g. be a server log, you could want to notice = that some IP or set of IPs tried to log in with different user accounts = across an extended period of time. So even if there is no linguistic = relationship here, there is definitely a relationship that a security = person would want to be able to discover. But that may be a secondary = step after parsing the individual log lines. -- Richard On 18.10.2013, at 12:34, Jens Grivolla wrote: > Ok, but then log files are usually very easy to split since they = normally consist of independent lines. So you could just have one = document per day or whatever gets it down to a reasonable size, without = the risk of breaking grammatical or semantic relationships. >=20 > On 10/18/2013 12:25 PM, Armin Wegner wrote: >> Hi Jens, >>=20 >> It's a log file. >>=20 >> Cheers, >> Armin >>=20 >> -----Urspr=FCngliche Nachricht----- >> Von: Jens Grivolla [mailto:j+asf@grivolla.net] >> Gesendet: Freitag, 18. Oktober 2013 11:05 >> An: user@uima.apache.org >> Betreff: Re: Working with very large text documents >>=20 >> On 10/18/2013 10:06 AM, Armin Wegner wrote: >>=20 >>> What are you doing with very large text documents in an UIMA = Pipeline, for example 9 GB in size. >>=20 >> Just out of curiosity, how can you possibly have 9GB of text that = represent one document? =46rom a quick look at project gutenberg it = seems that a full book with HTML markup is about 500kB to 1MB, so that's = about a complete public library full of books. >>=20 >> Bye, >> Jens