uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl <pklu...@uni-wuerzburg.de>
Subject Re: TextMarker language workthrough for text simplification example?
Date Mon, 19 Nov 2012 17:14:27 GMT
Ah ok, the attachment was removed.

Here's the script file:

PACKAGE com.sap.research.bd.ta;

//TYPESYSTEM ExternalTypeSystem;
ENGINE utils.Modifier;


// some helper annotations

DECLARE SimpleSentence;
//TreebankNode{FEATURE("nodeType", "S") -> MARK(S)};
ANY+{-PARTOF(SimpleSentence), -PARTOF(PERIOD) -> MARK(SimpleSentence, 1, 
2)} PERIOD;

DECLARE WDT;
//TerminalTreebankNode{FEATURE("nodeType", "WDT") -> MARK(WDT)};
SW{REGEXP("which") -> MARK(WDT)};
SW{REGEXP("who") -> MARK(WDT)};

DECLARE NP;
//TreebankNode{FEATURE("nodeType", "NP") -> MARK(NP)};
"the" "Fulton" "court"{-> MARK(NP, 1, 2, 3)};
CW{REGEXP("Peter") -> MARK(NP)};

DECLARE PMComma;
//TerminalTreebankNode{FEATURE("nodeType", ",") -> MARK(PMComma)};
COMMA{ -> MARK(PMComma)};

// here start the real rules

DECLARE Head, Tail, RCAnchor, RCHead, RCBody, RCBodyEnd;

BLOCK(forEachSentence) SimpleSentence{CONTAINS(WDT)} {
     NP{ -> MARK(RCAnchor)} PMComma WDT{ -> MARK(RCHead, 2, 3)} 
ANY+{-PARTOF(PMComma) -> MARK(RCBody)}
         (PMComma{-> MARK(RCBodyEnd)} ANY+{ -> MARK(Tail)})?;
     ANY+{-PARTOF(Head) -> MARK(Head)} RCAnchor;

     STRING anchorString, tailString;
     Document{-> ASSIGN(tailString, ".")};
     Tail{ -> MATCHEDTEXT(tailString)};
     RCAnchor{ -> MATCHEDTEXT(anchorString)};
     RCHead{ -> REPLACE(" " + tailString + " " + anchorString)};
     RCBodyEnd{ -> REPLACE(".")};
     Tail{ -> DEL};
}
Document{ -> EXEC(Modifier)};


On 19.11.2012 18:10, Peter Klügl wrote:
> Hi Fergal,
>
> I played a bit around and attached the resulting TextMarker project.
>
> The first part of the script is only there for creating some 
> annotations. I haven't used ClearTK for a while and was too lazy to 
> update it.
>
> The main part with the block looks at sentences with WDTs and creates 
> some annotations. The rules with REPLACE are used to remember the 
> changes and the rule with EXEC(Modifier) creates a new view with the 
> changed document.
>
> The changes are located in the view named "modified":
>
> "The jury also commented on the Fulton court, which has been under 
> fire for its practices in the appointment of appraisers."
> ...becomes...
> "The jury also commented on the Fulton court . the Fulton court has 
> been under fire
> for its practices in the appointment of appraisers."
> (I removed some words, because I included no correct identification of 
> relative clauses)
>
> "Peter, who just woke up, goes to work."
> ...becomes...
> "Peter goes to work. Peter  just woke up."
>
> It's only a fast and ugly hack, but I hope this helps a bit.
>
> If you have any questions, just ask :-)
>
> Best,
>
> Peter
>
> On 19.11.2012 16:04, Peter Klügl wrote:
>> I can see only one attached file: TextSimplifier.xml
>>
>> Can you send me the input file, the rules and the type systems.
>>
>> Peter
>>
>> On 19.11.2012 13:45, Monaghan, Fergal wrote:
>>>
>>> I've attached here the descriptor ("TextSimplifier.xml": 
>>> configuration for TextMarkerEngine), the test input data 
>>> ("random01.txt.xmi": Cleartk[OpenNLP] annotated), the rules file 
>>> ("rules.tm": with 1 rule, my first partial attempt at the text 
>>> simplification process) and the current output ("1.xmi": one 
>>> additional tag has been created by the rule), if this helps,
>>>
>>> Thanks again,
>>>
>>> Fergal.
>>>
>>> *From:*fergal.monaghan@sap.com
>>> *Sent:* 19 November 2012 09:56
>>> *To:* 'user@uima.apache.org'
>>> *Subject:* TextMarker language workthrough for text simplification 
>>> example?
>>>
>>> Hi all (and especially the good folks working on TextMarker in the 
>>> sandbox),
>>>
>>> 1. I am interested in implementing the type of text simplification 
>>> rules set out in this paper [1].
>>>
>>> 2. I would prefer to use TextMarker (and its language) natively in 
>>> UIMA than use the UIMA<->GATE integration and JAPE rules.
>>>
>>> 3. I have cloned TextMarker from the repo and have configured an 
>>> analysis engine descriptor to run TextMarkerEngine using custom rules.
>>>
>>> 4. I have switched off the TextMarkerEngine seed annotations as I am 
>>> testing on pre-processed XMI files that have been pre-annotated with 
>>> the Cleartk type systems (up to and including TreebankNodes... 
>>> OpenNLP used under the hood if that's of interest).
>>>
>>> 5. Things are building and unit tests running fine on simple rules. 
>>> Yay! Good work guys :)
>>>
>>> Now I am focussing on customising the rules for the text 
>>> simplification application. I have been studying the TextMarker 
>>> language documentation here [2] as well as TextMarker's unit tests 
>>> in the sandbox to get things working so far, but am now asking for 
>>> your help to complete one of the example rules I'd like to 
>>> implement. This is the example from [1]:
>>>
>>> Input (original):
>>>
>>> "The jury also commented on the Fulton court, which has been under 
>>> fire for its practices in the appointment of appraisers, guardians 
>>> and administrators."
>>>
>>> Output (simplified):
>>>
>>> "The jury also commented on the Fulton court." "The Fulton court has 
>>> been under fire for its practices in the appointment of appraisers, 
>>> guardians and administrators."
>>>
>>> Rule I want to implement in the TextMarker language:
>>>
>>> V W:NP_ant, Rel Clause(X:Rel Pr Y), Z. ->            V W Z. W Y.
>>>
>>> which can be interpreted as "If a sentence consists of any text V 
>>> followed by the antecedent noun phrase W, a relative clause 
>>> (consisting of a relative pronoun X and a sequence of words Y) 
>>> enclosed in commas and a sequence of words Z, then the embedded 
>>> clause can be made into a new sentence with W as the subject NP".
>>>
>>> So far I have gotten to this in the TextMarker language (please see 
>>> below the contents of my rules.tm file that I'm running through 
>>> TextMarker). Please note this itself is not an attempt at the final 
>>> complete rule, but some intermediate attempt that is the furthest 
>>> I've been able to get on my own which still passes unit tests:
>>>
>>> ===============================================
>>>
>>> PACKAGE org.cleartk.syntax.constituent.type;
>>>
>>> (TreebankNode{FEATURE("nodeType","NP")} 
>>> TerminalTreebankNode{FEATURE("nodeType",",")} 
>>> TerminalTreebankNode{FEATURE("nodeType","WDT")} 
>>> TreebankNode{FEATURE("nodeType","S")}){->MARK(com.sap.research.bd.ta.AdjectivalOrRelativeClause)};
>>>
>>> ===============================================
>>>
>>> Can someone complete this rule to get me closer to the example 
>>> above? I lack understanding of the TextMarker language, but I feel 
>>> that if I had an example of this slightly more complex rule than 
>>> what is present in the unit tests/documentation, that I would be 
>>> able to work it out for the rest of the rules I want to implement.
>>>
>>> Thanks very much for reading, and for any help you can provide,
>>>
>>> *Fergal Monaghan*
>>> B.E., Ph.D.   |   Research Specialist   |   SAP Research
>>> *SAP (UK) Limited*   |   The Concourse   |   Queen's Road   | 
>>> Belfast BT3 9DT
>>>
>>> T: +44 (0)28 9078-5705   |   M:   +44 (0)79 2076-6281   | F: +44 
>>> (0)28 9078-5777
>>>
>>> mailto:fergal.monaghan@sap.com | www.sap.com/research 
>>> <http://www.sap.com/research>__
>>>
>>> [1] http://homepages.abdn.ac.uk/advaith/pages/LEC02.pdf 
>>> <http://homepages.abdn.ac.uk/advaith/pages/LEC02.pdf>
>>>
>>> [2] 
>>> http://tmwiki.informatik.uni-wuerzburg.de/Wiki.jsp?page=Introduction
>>>
>>
>>
>


Mime
View raw message