uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henrik Matzen <huric...@googlemail.com>
Subject Serialization NonXML
Date Tue, 05 Jul 2016 10:02:45 GMT
Hi,

because of the known problem that you cannot serialize the cas if it has
non xml chracters I tried this:

I know its not working because of this (cas =
doReplaceNonXml(cas.toString()).toCas;)
- Because there is no .toCas method.

Does anyone of you know how I can solve this?

    @Override
    public void process(final JCas cas) throws
AnalysisEngineProcessException {
        JCas oldcas = cas;
        cas = doReplaceNonXml(cas.toString()).toCas;
        try {
            final String xmlContent = this.serializeCas(cas);
            final Map<String, String> metadataFields =
this.extractMetadata(xmlContent);

            //Do something with metadatafields
            cas = oldcas;
        }

        } catch (SAXException e) {
            throw new AnalysisEngineProcessException(e);
        } catch (IOException e) {
            throw new AnalysisEngineProcessException(e);
        } catch (ParserConfigurationException e) {
            throw new AnalysisEngineProcessException(e);
        }


    private String doReplaceNonXml(String aString)
    {

        char[] buf = aString.toCharArray();
        int pos = XMLUtils.checkForNonXmlCharacters(buf, 0, buf.length,
false);

        if (pos == -1) {
            return aString;
        }

        while (pos != -1) {
            buf[pos] = ' ';
            pos = XMLUtils.checkForNonXmlCharacters(buf, pos, buf.length -
pos, false);
        }
        return String.valueOf(buf);
    }

    private String serializeCas(final JCas cas) throws SAXException,
IOException {
        // TODO: think about buffering and performance
        final ByteArrayOutputStream out = new ByteArrayOutputStream(1024);
        final XmiCasSerializer ser = new
XmiCasSerializer(cas.getTypeSystem());
        try {
            ser.serialize(cas.getCas(), (new XMLSerializer(out,
false)).getContentHandler());
        } finally {
            out.close();
        }
        return out.toString();
    }

    private Map<String, String> extractMetadata(final String xmlContent)
throws SAXException, IOException,
            ParserConfigurationException {

        final Map<String, String> resultMap = new HashMap<String, String>();

        // parse xmlContent String with java SAX parser
        final DocumentBuilderFactory dbFactory =
DocumentBuilderFactory.newInstance();
        final DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
        InputStream stream = new
ByteArrayInputStream(xmlContent.getBytes(StandardCharsets.UTF_8));
        final Document doc = dBuilder.parse(stream);

        // get meta data field node
        final NodeList nl = doc.getElementsByTagName("oze:MetaField");
        if (nl == null) {
            return resultMap;
        }

Best regards!

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message