uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thilo Goetz <twgo...@gmx.de>
Subject Re: AW: Working with very large text documents
Date Fri, 18 Oct 2013 15:01:08 GMT
Don't you have a hadoop cluster you can use?  Hadoop would handle the 
file splitting for you, and if your UIMA analysis is well-behaved, you 
can deploy it as a M/R job, one record at a time.

--Thilo

On 10/18/2013 12:25 PM, Armin.Wegner@bka.bund.de wrote:
> Hi Jens,
>
> It's a log file.
>
> Cheers,
> Armin
>
> -----Urspr√ľngliche Nachricht-----
> Von: Jens Grivolla [mailto:j+asf@grivolla.net]
> Gesendet: Freitag, 18. Oktober 2013 11:05
> An: user@uima.apache.org
> Betreff: Re: Working with very large text documents
>
> On 10/18/2013 10:06 AM, Armin Wegner wrote:
>
>> What are you doing with very large text documents in an UIMA Pipeline, for example
9 GB in size.
>
> Just out of curiosity, how can you possibly have 9GB of text that represent one document?
From a quick look at project gutenberg it seems that a full book with HTML markup is about
500kB to 1MB, so that's about a complete public library full of books.
>
> Bye,
> Jens
>


Mime
View raw message