uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Armin.Weg...@bka.bund.de>
Subject AW: Working with very large text documents
Date Fri, 18 Oct 2013 13:58:12 GMT
Dear Jens, dear Richard,

Looks like I have to use a log file specific pipeline. The problem was that I did not knew
it before the process crashed. It would be so nice having a general approach.

Thanks,
Armin

-----Urspr√ľngliche Nachricht-----
Von: Richard Eckart de Castilho [mailto:rec@apache.org] 
Gesendet: Freitag, 18. Oktober 2013 12:32
An: user@uima.apache.org
Betreff: Re: Working with very large text documents

Hi Armin,

that's a good point. It's also an issue with UIMA then, because the begin/end offsets are
likewise int values.

If it is a log file, couldn't you split it into sections of e.g.
one CAS per day and analyze each one. If there are long-distance relations that span days,
you could add a second pass which reads in all analyzed cases for a rolling window of e.g.
7 days and tries to find the long distance relations in that window.

-- Richard

On 18.10.2013, at 10:48, Armin.Wegner@bka.bund.de wrote:

> Hi Richard,
> 
> As far as I know, Java strings can not be longer than 2 GB on 64bit VMs.
> 
> Armin
> 
> -----Urspr√ľngliche Nachricht-----
> Von: Richard Eckart de Castilho [mailto:rec@apache.org]
> Gesendet: Freitag, 18. Oktober 2013 10:43
> An: user@uima.apache.org
> Betreff: Re: Working with very large text documents
> 
> On 18.10.2013, at 10:06, Armin.Wegner@bka.bund.de wrote:
> 
>> Hi,
>> 
>> What are you doing with very large text documents in an UIMA Pipeline, for example
9 GB in size.
> 
> In that order of magnitude, I'd probably try to get a computer with 
> more memory ;)
> 
>> A. I expect that you split the large file before putting it into the pipeline. Or
do you use a multiplier in the pipeline to split it? Anyway, where do you split the input
file? You can not just split it anywhere. There is a not so slight possibility to break the
content. Is there a preferred chunk size for UIMA?
> 
> The chunk size would likely not depend on UIMA, but rather on the machine you are using.
If you cannot split the data in defined locations, maybe you can use a windowing approach
where two splits have a certain overlap?
> 
>> B. Another possibility might be not to save the data in the CAS at all and use an
URI reference instead. It's up to the analysis engine then how to load the data. My first
idea was to use java.util.Scanner for regular expressions for examples. But I think that you
need to have the whole text loaded to iterator over annotations. Or is just AnnotationFS.getCoveredText()
not working. Any suggestions here?
> 
> No idea unfortunately, never used the stream so far.
> 
> -- Richard
> 
> 


Mime
View raw message