uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Holmberg <holmberg2...@comcast.net>
Subject XMI parsing?
Date Wed, 20 Jan 2010 22:11:16 GMT


Hi UIMA users! 



I'm looking for advice on how to transmit data from a CAS to a non-UIMA recipient . 



I'd like to send data from a CAS over the network to a repository.  I can write any Java
code I want to run in the repository server to receive the data and insert it into the repository
indexes.  And no, the repository is not a SQL database, and there is no JDBC driver for it.




I'm thinking the easiest data format to transmit from the CAS would be XMI.  I can just use
the UIMA serialization methods to produce an XMI XML String, and then send that as a payload
over whatever transport I want (RMI, HTTP , FTP, JSON, SOAP, whatever). 



But then how would the repository server parse the XMI XML that it receives?  Obviously,
I could just use the UIMA de-serialization to re-constitute the CAS, but that's a lot of overhead
(time and memory) considering I don't actually neet to run UIMA in the repository, and I just
want to get the data values from the XMI and insert some records/objects in the repository
index. 



Can I parse the XMI XML from UIMA without using UIMA? 



For example, is there a XSD file for XMI?  Or at least, for the UIMA "flavor" of XMI?  If
so, I could feed the XSD file to JAXB to generate equivalent Java classes, then JAXB would
parse and validate the XMI, producing Java objects. 



I suppose I could also parse the XMI with the XML StAX parser built into Java 6, and just
bypass the creation of Java objects (directly inserting into the repository).  More work,
but might use less memory and perform better. 



Or, instead of XMI, I could walk the CAS myself, and invent some data format (JSON? SOAP?
RMI?) to send to the repository.  This could be binary to lessen the data on the network
and ease the unmarshalling on the other end.  Performance and network bandwidth are an issue
for me, since this has to scale (there will be many clients sending CAS data to the repository).




I seem to remember that the serialization of the CAS between Java and C++ uses a fast binary
format.  Would that be a possibility here?  Could I read that without re-constituting the
CAS in the repository? 



What are your thoughts on these options? 



Thanks, 





Greg Holmberg 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message