uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thilo Goetz <twgo...@gmx.de>
Subject Re: Content segmentation
Date Tue, 17 Jun 2008 05:46:36 GMT
Yaakov Chaikin wrote:
> Hi,
> 
> I wanted to find out if UIMA has any concept of content segmentation.
> Some of the analysis processing is very memory and CPU intensive and
> if the content happens to be huge (like a book), it will bring the
> server to a crawl.
> 
> So, I was wondering if the UIMA framework has any notion of breaking
> up the content into smaller segments.
> 
> Thanks,
> Yaakov.

As Eddie says, you can split
up your long document before you feed it to UIMA, if you don't
have any analysis that depends on the whole document.  What
we've done in the past is to split up documents into 500k
chunks.  At the chunk boundary, we heuristically looked for
what was probably the next sentence end, then split the document
there.

This very likely takes the same amount of CPU, but it makes the
analysis fit into memory, which otherwise may be a problem.

If you run multiple analysis chains in parallel, you may need
to keep track of document and document part IDs in the CAS.  If
your analysis runs in a single thread, that should not be an
issue.

HTH,
Thilo


Mime
View raw message