Mailing-List: contact user-help@uima.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@uima.apache.org
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\))
Subject: Re: Very long Ruta stream initialization
From: Mario Gazzo <mario.gazzo@gmail.com>
In-Reply-To: <568E2CD3.4040805@averbis.com>
Date: Thu, 7 Jan 2016 10:22:26 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <DB55828E-F1D5-4E0E-BECB-794D1765E5D7@gmail.com>
References: <B78E9FF1-2117-47DD-940C-D3AFAADB4433@gmail.com>
 <D6EEC6E7-0252-4B60-A8EC-74165735751A@gmail.com>
 <5683FBCE.7020807@averbis.com>
 <91BA95B1-DF4C-459F-BF59-7A21F2DC585B@gmail.com>
 <568A987E.1050205@averbis.com>
 <234FB22B-46FA-4572-9FCE-446983715E60@gmail.com>
 <568E1D1B.2040809@averbis.com>
 <7C01BF55-09D2-4C25-8F59-CA4703B77FFE@gmail.com>
 <568E2BCA.2060608@averbis.com>
 <10BD374F-DF7E-41E3-807D-8AEA6EAE5BEB@gmail.com>
 <568E2CD3.4040805@averbis.com>
To: user@uima.apache.org

Yes, where do we sign this?

:-)

> On 07 Jan 2016, at 10:16 , Peter Kl=C3=BCgl <peter.kluegl@averbis.com> =
wrote:
>=20
> :-) let me know if you need help or have any questions.
>=20
> Am 07.01.2016 um 10:12 schrieb Mario Gazzo:
>> Yes, let us just sign and submit it.
>>=20
>>> On 07 Jan 2016, at 10:11 , Peter Kl=C3=BCgl =
<peter.kluegl@averbis.com> wrote:
>>>=20
>>> Hi,
>>>=20
>>> thanks, that would be great. Patches are simply attached to the =
issue.
>>> Non-trivial changes require an ICLA. Do you want to sign and submit =
it?
>>>=20
>>> Best,
>>>=20
>>> Peter
>>>=20
>>>=20
>>> Am 07.01.2016 um 10:08 schrieb Mario Gazzo:
>>>> Thanks,
>>>>=20
>>>> I just added the JIRA issue: =
https://issues.apache.org/jira/browse/UIMA-4729 =
<https://issues.apache.org/jira/browse/UIMA-4729>
>>>>=20
>>>> If you like, then we can also implement it and submit a patch, just =
let us know what the process is.
>>>>=20
>>>> Cheers
>>>> Mario
>>>>=20
>>>>> On 07 Jan 2016, at 09:08 , Peter Kl=C3=BCgl =
<peter.kluegl@averbis.com> wrote:
>>>>>=20
>>>>> Hi,
>>>>>=20
>>>>> Am 06.01.2016 um 14:48 schrieb Mario Gazzo:
>>>>>> Hi Peter,
>>>>>>=20
>>>>>> I had a look at the test cases and I think there are many =
interesting and useful features that cover many of our use cases but I =
will have to experiment with them before I know what might be missing. I =
have a few questions though:
>>>>>>=20
>>>>>> 1) It appears that we would then also be able to assign =
annotations to lists, which is nice. I am not sure from looking at the =
tests whether it is possible to use ADD with the annotation lists but I =
assume so.
>>>>> Not yet, but I will implement it. It's still work in progress. But
>>>>> thanks for pointing it out, I would probably have forgotten about =
it.
>>>>>=20
>>>>>> 2) The use of addresses is unclear to me just from reading the =
test, maybe you could explain them.? This concept is very new to me.
>>>>> It's not intented be to utilized directly in a rule file. It's =
rather
>>>>> just a way to combine logic in java with ruta rules or use ruta
>>>>> functionality in java code.
>>>>> Let's say we have a new method like
>>>>> boolean Ruta.matches(CAS cas, String rule, AnnotationFS... =
annotations)
>>>>> and you call it with something like (syntax is not yet specified)
>>>>> Ruta.matches(cas, "${PARTOF(Headline)} Keyword;", annotation)
>>>>> Then, the "$" would be replaced by the address of the annotation =
and the
>>>>> method would return whether the annotation is covered by a =
Headline
>>>>> annotation and is followed by a Keyword annotation.
>>>>>=20
>>>>>> 3) The annotation feature expression looks nice but I wonder =
whether an array element can also be referenced using an int expression =
and not just a constant e.g. Struct.as[intVar+1]{->T1};
>>>>> Yes, without allowing number expressions, it would not really be =
useful.
>>>>> The current implementation is just a test in order to check =
whether the
>>>>> internal object model is good enough to cover it. The complete
>>>>> functionality will probably not be included in the next release =
since
>>>>> there is still much work left in order to get it up and running. =
The
>>>>> semantics of such expressions (Struct.as) are resolved on the fly, =
and
>>>>> the code odes not support expressions at all. I still have to =
think
>>>>> about a way to implement it.
>>>>>=20
>>>>>> The label expressions are also useful and will make some of our =
rules more readable.
>>>>>>=20
>>>>>> Finally I have one additional question to the MARKUP =
initialisation. I have a case where I need the token seeds coming from =
the default seeder but I don=E2=80=99t want to run the markup =
initialisation. Is there a separate seeder defined for this somewhere? =
Right now I have my own copy of the default seeder without the MARKUP =
initialisation but obviously I do not want to maintain this. It looks as =
if they could also be split in two seeders with both added as default =
and then I could overwrite with my own seeder list containing only the =
token seeder.
>>>>> Yes, we can split them or just add another one that ignores =
markup. I
>>>>> was also always thinking about adding a DetailedSeeder that =
creates much
>>>>> more finegrained types like different brackets and quotes... but =
it was
>>>>> never on top of my todo list.
>>>>>=20
>>>>> Do you want to open a jira issue for it?
>>>>>=20
>>>>> Best,
>>>>>=20
>>>>> Peter
>>>>>=20
>>>>>> Cheers
>>>>>> Mario
>>>>>>=20
>>>>>>=20
>>>>>>> On 04 Jan 2016, at 17:06 , Peter Kl=C3=BCgl =
<peter.kluegl@averbis.com> wrote:
>>>>>>>=20
>>>>>>> Hi,
>>>>>>>=20
>>>>>>> Am 04.01.2016 um 16:13 schrieb Mario Gazzo:
>>>>>>>> Hi Peter,
>>>>>>>>=20
>>>>>>>> No problem, I was anyway pretty much offline myself during =
Christmas holidays.
>>>>>>>>=20
>>>>>>>> The term =E2=80=9Coverhead=E2=80=9D is probably an exaggeration =
in this context especially after I disabled the MARKUP initialisation. =
We implemented earlier our own XML markup annotator tailored to better =
fit our needs with additional annotation types and properties, so the =
Ruta MARKUP is currently not used. It just happens that we don=E2=80=99t =
directly use RutaBasic in any of our rules in this particular case so I =
was curious to know whether we could avoid creating them in the first =
place since there seems to be quite a few. However, overall processing =
required by our Ruta scripts compared to other processing steps is now =
small and sub-optimising this further by making RutaBasic optional would =
currently be of very low priority to us. We would prioritise other =
features higher e.g. being able to assign annotations to variables as we =
discussed previously in another thread.
>>>>>>> I am working on this right now and there is finally some first =
progress :-)
>>>>>>>=20
>>>>>>> I fear that I won't catch all use cases (combinations with =
language
>>>>>>> elements) with the first attempt. If you are interested (and =
wanna take
>>>>>>> care I do not miss your use case), feel free to take a look at =
the new
>>>>>>> unit tests:
>>>>>>> =
https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/o=
rg/apache/uima/ruta/expression/annotation
>>>>>>>=20
>>>>>>> It's still work in progress. Proposals for more unit tests are =
very welcome.
>>>>>>>=20
>>>>>>>> We haven=E2=80=99t processed documents as large as those you =
mention since books have so far been divided into chapters and =
processing could therefore be parallelised accordingly. We also drop =
extreme outliers above a certain size if we encounter them and then we =
batch process them later in smaller chunks but this has so far not been =
necessary with our current data sets. Like you, our processing =
bottlenecks are now in different components.
>>>>>>> Ah, that's nice to hear that ruta is not the bottleneck :-D
>>>>>>>=20
>>>>>>> Best,
>>>>>>>=20
>>>>>>> Peter
>>>>>>>=20
>>>>>>>=20
>>>>>>>> Cheers
>>>>>>>> Mario
>>>>>>>>=20
>>>>>>>>> On 30 Dec 2015, at 16:44 , Peter Kl=C3=BCgl =
<peter.kluegl@averbis.com> wrote:
>>>>>>>>>=20
>>>>>>>>> Hi,
>>>>>>>>>=20
>>>>>>>>> sorry for the delayed reply.
>>>>>>>>>=20
>>>>>>>>> RutaEngine::initializeStream:
>>>>>>>>>=20
>>>>>>>>> The special treatment of MARKUPs that causes the increased =
time required for initialization is just a workaround because I was to =
lazy to write a working jflex rule. Well, I tried but failed. It =
shouldn't be hard be to improve this code... I will create an issue for =
it. When I did the last performance optimization, uima did not check the =
indexes yet and my test set did not contain markups.
>>>>>>>>>=20
>>>>>>>>> Deactivate creation of RutaBasic:
>>>>>>>>> Short answer is no. I was already thinking about making =
RutaBasic optional in future so that the user can configure if they are =
used. However, right now, they are required for rule inference and make =
the rule inference "fast" in the first place. RutaBasic is just an =
internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and =
RutaFrame, and rules should not match on them at all.
>>>>>>>>>=20
>>>>>>>>> Some background information:
>>>>>>>>>=20
>>>>>>>>> RutaBasics are used for three things:
>>>>>>>>> - store additional information in order to avoid index =
operations. Some useful conditions would require many index operations, =
e.g., PARTOF or ENDSWITH. RutaBasic is utilized as a cache what =
annotations start and end at which position, and which positions are =
covered by which types.
>>>>>>>>> - provide a container to make this information available =
across analysis engines. Information shared by analysis engine is =
normally stored in the CAS, e.g. in annotations, (or in external =
resources). This is the role of RutaBasic. It is not really implemented =
right now as it should be but I will improve it soon. Then, there is no =
performance decrease when a pipeline is spammed with small ruta engines.
>>>>>>>>> - a basic minimal disjunct partitioning of the document for =
the coverage based visibility concept.
>>>>>>>>>=20
>>>>>>>>> Making RutaBasic optional is possible. If there is a real need =
for it, e.g., in order to reduce the memory footprint or when processing =
large documents where parts are simply not interesting, then I will put =
it on my TODO list. I am also open for other/new ideas how to solve the =
challenges (and for incremental usage of internal caches).
>>>>>>>>>=20
>>>>>>>>> What is your experience with the processing overhead =
concerning RutaBasic? Is it the rule matching or rather the =
initialization? I myself had already some performance problems with the =
initalization and memory consumption in large CAS (500+ pages pdfs). =
However, other components, serialization and the CAS editor were the =
actual bottlenecks.
>>>>>>>>>=20
>>>>>>>>> Best,
>>>>>>>>>=20
>>>>>>>>> Peter
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo:
>>>>>>>>>> I got around it by removing the default seeders by specifying =
an empty seeders list since we don=E2=80=99t need the MARKUP annotations =
anymore.
>>>>>>>>>>=20
>>>>>>>>>> I still don=E2=80=99t know why it created so much overhead =
but it sometimes seemed to rival the POS tagger in processing time.
>>>>>>>>>>=20
>>>>>>>>>> Anyway, this leads me to the next question. Can I disable the =
creation of Ruta basic annotations entirely to save processing overhead =
and only apply Ruta rules to other annotation types created by other AEs =
such as our own?
>>>>>>>>>>=20
>>>>>>>>>> Cheers
>>>>>>>>>> Mario
>>>>>>>>>>=20
>>>>>>>>>>> On 21 Dec 2015, at 16:09 , Mario Juric =
<mario.juric.dk@gmail.com> wrote:
>>>>>>>>>>>=20
>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>=20
>>>>>>>>>>> I noticed that occasionally the initialisation in =
RutaEngine::initializeStream can tak very long time. I can=E2=80=99t =
really explain them and it seems independent of document length since I =
have seen this with even very small XML documents.
>>>>>>>>>>>=20
>>>>>>>>>>> The method seems to spend much time in the DefaultSeeder =
when creating MARKUP annotations during subiterator.moveToNext calls =
(line 89) and inside Subiterator it seems to be the while loop inside =
adjustForStrictForward (line 232), which is inside UIMA core classes. I =
haven=E2=80=99t gone into any deeper analysis yet but I first like to =
hear whether you have an idea what could be the main cause(s) for this?
>>>>>>>>>>>=20
>>>>>>>>>>> We use Ruta 2.3.1 with UIMA 2.8.1
>>>>>>>>>>>=20
>>>>>>>>>>>=20
>>>>>>>>>>> Cheers
>>>>>>>>>>> Mario
>=20