hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Can anyone recommend me a inter-language data file format?
Date Mon, 03 Nov 2008 11:07:30 GMT
Zhou, Yunqing wrote:
> embedded database cannot handle large-scale data, not very efficient
> I have about 1 billion records.
> these records should be passed through some modules.
> I mean a data exchange format similar to XML but more flexible and
> efficient.

erlang-style records (name,value,value,value)
RDF-triples in non-XML representations

For all of these, you need to test with data that includes things like 
high unicode characters, single and double quotes, to see how well they 
get handled.

you can actually append with XML by not having opening/closing tags, 
just stream out the entries to the tail of the file

To read this in an XML parser, include it inside another XML file:

<?xml version="1.0"?>
<!DOCTYPE log [
      <!ENTITY log SYSTEM "log.xml">


I've done this for very big files, as long as you aren't trying to load 
it in-memory to a DOM, things should work

Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

View raw message