uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Eckart de Castilho <...@apache.org>
Subject Re: Working with very large text documents
Date Fri, 18 Oct 2013 08:43:26 GMT
On 18.10.2013, at 10:06, Armin.Wegner@bka.bund.de wrote:

> Hi,
> What are you doing with very large text documents in an UIMA Pipeline, for example 9
GB in size.

In that order of magnitude, I'd probably try to get a computer with more memory ;) 

> A. I expect that you split the large file before putting it into the pipeline. Or do
you use a multiplier in the pipeline to split it? Anyway, where do you split the input file?
You can not just split it anywhere. There is a not so slight possibility to break the content.
Is there a preferred chunk size for UIMA?

The chunk size would likely not depend on UIMA, but rather on the machine you are using. If
you cannot split the data in defined locations, maybe you can use a windowing approach where
two splits have a certain overlap?

> B. Another possibility might be not to save the data in the CAS at all and use an URI
reference instead. It's up to the analysis engine then how to load the data. My first idea
was to use java.util.Scanner for regular expressions for examples. But I think that you need
to have the whole text loaded to iterator over annotations. Or is just AnnotationFS.getCoveredText()
not working. Any suggestions here?

No idea unfortunately, never used the stream so far.

-- Richard

View raw message