Mailing-List: contact user-help@uima.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@uima.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: <Armin.Wegner@bka.bund.de>
To: <user@uima.apache.org>
Subject: AW: Working with very large text documents
Thread-Topic: Working with very large text documents
Thread-Index: AQHOy+1HJS4WFOa67kKoalieqPYPJJn6cOeA
Date: Fri, 18 Oct 2013 13:58:12 +0000
Message-ID: <A734E21E566ADB40A70F2EBC5D3C2EB0157D5CD7@SWMMBX12.bk.bka.bund.de>
References: <A734E21E566ADB40A70F2EBC5D3C2EB0157D5C7C@SWMMBX12.bk.bka.bund.de>
 <88BA6AA7-F9E3-4003-8078-1317C75C06F0@apache.org>
 <A734E21E566ADB40A70F2EBC5D3C2EB0157D5C93@SWMMBX12.bk.bka.bund.de>
 <BE60C783-7E28-41A5-9D5C-DF1C0583BBD5@apache.org>
In-Reply-To: <BE60C783-7E28-41A5-9D5C-DF1C0583BBD5@apache.org>
Accept-Language: de-DE, en-US
MIME-Version: 1.0
Content-Type: multipart/signed;
 boundary="bar92f834cf79adc84178d821b8777fb942"; micalg=pgp-sha1;
 protocol="application/pgp-signature"

--bar92f834cf79adc84178d821b8777fb942
Content-Language: de-DE
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Dear Jens, dear Richard,

Looks like I have to use a log file specific pipeline. The problem was that=
 I did not knew it before the process crashed. It would be so nice having a=
 general approach.

Thanks,
Armin

-----Urspr=FCngliche Nachricht-----
Von: Richard Eckart de Castilho [mailto:rec@apache.org]=20
Gesendet: Freitag, 18. Oktober 2013 12:32
An: user@uima.apache.org
Betreff: Re: Working with very large text documents

Hi Armin,

that's a good point. It's also an issue with UIMA then, because the begin/e=
nd offsets are likewise int values.

If it is a log file, couldn't you split it into sections of e.g.
one CAS per day and analyze each one. If there are long-distance relations =
that span days, you could add a second pass which reads in all analyzed cas=
es for a rolling window of e.g. 7 days and tries to find the long distance =
relations in that window.

-- Richard

On 18.10.2013, at 10:48, Armin.Wegner@bka.bund.de wrote:

> Hi Richard,
>=20
> As far as I know, Java strings can not be longer than 2 GB on 64bit VMs.
>=20
> Armin
>=20
> -----Urspr=FCngliche Nachricht-----
> Von: Richard Eckart de Castilho [mailto:rec@apache.org]
> Gesendet: Freitag, 18. Oktober 2013 10:43
> An: user@uima.apache.org
> Betreff: Re: Working with very large text documents
>=20
> On 18.10.2013, at 10:06, Armin.Wegner@bka.bund.de wrote:
>=20
>> Hi,
>>=20
>> What are you doing with very large text documents in an UIMA Pipeline, f=
or example 9 GB in size.
>=20
> In that order of magnitude, I'd probably try to get a computer with=20
> more memory ;)
>=20
>> A. I expect that you split the large file before putting it into the pip=
eline. Or do you use a multiplier in the pipeline to split it? Anyway, wher=
e do you split the input file? You can not just split it anywhere. There is=
 a not so slight possibility to break the content. Is there a preferred chu=
nk size for UIMA?
>=20
> The chunk size would likely not depend on UIMA, but rather on the machine=
 you are using. If you cannot split the data in defined locations, maybe yo=
u can use a windowing approach where two splits have a certain overlap?
>=20
>> B. Another possibility might be not to save the data in the CAS at all a=
nd use an URI reference instead. It's up to the analysis engine then how to=
 load the data. My first idea was to use java.util.Scanner for regular expr=
essions for examples. But I think that you need to have the whole text load=
ed to iterator over annotations. Or is just AnnotationFS.getCoveredText() n=
ot working. Any suggestions here?
>=20
> No idea unfortunately, never used the stream so far.
>=20
> -- Richard
>=20
>=20


--bar92f834cf79adc84178d821b8777fb942
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iQEcBAABAgAGBQJSYT53AAoJEAk50sqYef+/2qYIAMvz10VmcrurxG9LF95W9MKT
+rVU501AzwUtc61daPQAjv602fShA0v9mSbScK/UrjLxkuTxlYqxCqyU2z1dGp8D
joC1uzQy145oibD7NKAC+DtR877MwsNXMRDFS6HW+sNQeege8UzNP/sN38W9JbrQ
JR9h5kiQUIS/78MzI0RoXyJ7807Xs2W56XXBKTwqNwcyX1K7Oye03FO+Uc71Q916
HsBiRFzr7XyxmjaFn7UK3TjSU20tZESL9Sd5Q+uquyb8im/gA49Eli5RasUWmiG5
xq4NwtmZfubgcSMahc4ChBeEameP9Os6cvMtG3sPeJSp2ioWGm/8s8NkJZIp798=
=VfGc
-----END PGP SIGNATURE-----
--bar92f834cf79adc84178d821b8777fb942--