hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <ku...@apache.org>
Subject Re: How to make a lucene Document hadoop Writable?
Date Wed, 28 May 2008 03:11:16 GMT
You can get the bytes using those methods and write them to a data 
output.  You would probably also want to write an int before it in the 
stream to tell the number of bytes for the object.  If you are wanting 
to not use the java serialization process and translate an object to 
bytes that is a little harder.

To do it involves using reflection to get the fields of an object 
recursively and translate those fields into their byte equivalents. 
Just so happens that I have that functionality already developed.  We 
are going to use it in nutch 2 to make it easy to create complex 
writables.  Let me know if you would like the code and I will send it to 

Also I spoke to soon about the serialization / deserialization process. 
  Reading a document from a Lucene index will also lose the fields that 
are not stored so it may have nothing to do with the serialization process.


Jim the Standing Bear wrote:
> Thanks for the quick response, Dennis.  However, your code snippet was
> about how to serialize/deserialize using
> ObjectInputStream/ObjectOutputStream.  Maybe it was my fault for not
> making the question clear enough - I was wondering if and how I can
> serialize/deserialize using only DataInput and DataOutput.
> This is because the Writable Interface defined by Hadoop has the
> following two methods:
> void 	readFields(DataInput in)
>           Deserialize the fields of this object from in.
> void 	write(DataOutput out)
>           Serialize the fields of this object to out
> so I must start with DataInput and DataOutput, and work my way to
> ObjectInputStream and ObjectOutputStream.  Yet I have not found a way
> to go from DataInput to ObjectInputStream.  Any ideas?
> -- Jim
> On Tue, May 27, 2008 at 10:50 PM, Dennis Kubes <kubes@apache.org> wrote:
>> You can use something like the code below to go back and forth from
>> serializables.  The problem with lucene documents is that fields which are
>> not stored will be lost during the serialization / deserialization process.
>> Dennis
>> public static Object toObject(byte[] bytes, int start)
>>  throws IOException, ClassNotFoundException {
>>  if (bytes == null || bytes.length == 0 || start >= bytes.length) {
>>    return null;
>>  }
>>  ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
>>  bais.skip(start);
>>  ObjectInputStream ois = new ObjectInputStream(bais);
>>  Object bObject = ois.readObject();
>>  bais.close();
>>  ois.close();
>>  return bObject;
>> }
>> public static byte[] fromObject(Serializable toBytes)
>>  throws IOException {
>>  ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>  ObjectOutputStream oos = new ObjectOutputStream(baos);
>>  oos.writeObject(toBytes);
>>  oos.flush();
>>  byte[] objBytes = baos.toByteArray();
>>  baos.close();
>>  oos.close();
>>  return objBytes;
>> }
>> Jim the Standing Bear wrote:
>>> Hello,
>>> I am not sure if this is a genuine hadoop question or more towards a
>>> core-java question.  I am hoping to create a wrapper over Lucene
>>> Document, so that this wrapper can be used for the value field of a
>>> Hadoop SequenceFile, and therefore, this wrapper must also implement
>>> the Writable interface.
>>> Lucene's Document is already made serializable, which is quite nice.
>>> However, the Writable interface definition gives only DataInput and
>>> DataOutput, and I am having a hard time trying to figure out how to
>>> serialize/deserialize an lucene Document object using
>>> DataInput/DataOutput.  In other words, how do I go from DataInput to
>>> ObjectInputStream, or from DataOutput to ObjectOutputStream?  Thanks.
>>> -- Jim

View raw message