Hi,
because of the known problem that you cannot serialize the cas if it has
non xml chracters I tried this:
I know its not working because of this (cas =
doReplaceNonXml(cas.toString()).toCas;)
- Because there is no .toCas method.
Does anyone of you know how I can solve this?
@Override
public void process(final JCas cas) throws
AnalysisEngineProcessException {
JCas oldcas = cas;
cas = doReplaceNonXml(cas.toString()).toCas;
try {
final String xmlContent = this.serializeCas(cas);
final Map<String, String> metadataFields =
this.extractMetadata(xmlContent);
//Do something with metadatafields
cas = oldcas;
}
} catch (SAXException e) {
throw new AnalysisEngineProcessException(e);
} catch (IOException e) {
throw new AnalysisEngineProcessException(e);
} catch (ParserConfigurationException e) {
throw new AnalysisEngineProcessException(e);
}
private String doReplaceNonXml(String aString)
{
char[] buf = aString.toCharArray();
int pos = XMLUtils.checkForNonXmlCharacters(buf, 0, buf.length,
false);
if (pos == -1) {
return aString;
}
while (pos != -1) {
buf[pos] = ' ';
pos = XMLUtils.checkForNonXmlCharacters(buf, pos, buf.length -
pos, false);
}
return String.valueOf(buf);
}
private String serializeCas(final JCas cas) throws SAXException,
IOException {
// TODO: think about buffering and performance
final ByteArrayOutputStream out = new ByteArrayOutputStream(1024);
final XmiCasSerializer ser = new
XmiCasSerializer(cas.getTypeSystem());
try {
ser.serialize(cas.getCas(), (new XMLSerializer(out,
false)).getContentHandler());
} finally {
out.close();
}
return out.toString();
}
private Map<String, String> extractMetadata(final String xmlContent)
throws SAXException, IOException,
ParserConfigurationException {
final Map<String, String> resultMap = new HashMap<String, String>();
// parse xmlContent String with java SAX parser
final DocumentBuilderFactory dbFactory =
DocumentBuilderFactory.newInstance();
final DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
InputStream stream = new
ByteArrayInputStream(xmlContent.getBytes(StandardCharsets.UTF_8));
final Document doc = dBuilder.parse(stream);
// get meta data field node
final NodeList nl = doc.getElementsByTagName("oze:MetaField");
if (nl == null) {
return resultMap;
}
Best regards!
|