hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim the Standing Bear" <standingb...@gmail.com>
Subject Re: How to make a lucene Document hadoop Writable?
Date Wed, 28 May 2008 03:16:33 GMT
I am replying to myself because I just found something interesting in
Nutch, yet it raises more questions.

In Nutch 0.9 source code, in org.apache.nutch.indexer.Indexer.java,
there is a line that says:

output.collect(key, new ObjectWritable(doc));

where doc is a lucene Document object.  This seems to be casting a
Document to a Hadoop ObjectWritable object.

However, in Hadoop's (v0.17.0) ObjectWritable.java, I found the following lines:

   } else if (Writable.class.isAssignableFrom(declaredClass)) { // Writable
      UTF8.writeString(out, instance.getClass().getName());
      ((Writable)instance).write(out);

    } else {
      throw new IOException("Can't write: "+instance+" as "+declaredClass);
    }

where instance is an Object object, set in the constructor, and
declaredClass is the class of the object.  But I am a bit suspicious
on the check and wonder how it will ever be true:

Writable.class.isAssignableFrom(Document)

Is it because Nutch 0.9 is using an older version of Hadoop as well as
lucene?  I am really confused.  Thanks.

-- Jim




On Tue, May 27, 2008 at 11:02 PM, Jim the Standing Bear
<standingbear@gmail.com> wrote:
> Thanks for the quick response, Dennis.  However, your code snippet was
> about how to serialize/deserialize using
> ObjectInputStream/ObjectOutputStream.  Maybe it was my fault for not
> making the question clear enough - I was wondering if and how I can
> serialize/deserialize using only DataInput and DataOutput.
>
> This is because the Writable Interface defined by Hadoop has the
> following two methods:
>
> void    readFields(DataInput in)
>          Deserialize the fields of this object from in.
> void    write(DataOutput out)
>          Serialize the fields of this object to out
>
> so I must start with DataInput and DataOutput, and work my way to
> ObjectInputStream and ObjectOutputStream.  Yet I have not found a way
> to go from DataInput to ObjectInputStream.  Any ideas?
>
> -- Jim
>
>
>
>
> On Tue, May 27, 2008 at 10:50 PM, Dennis Kubes <kubes@apache.org> wrote:
>> You can use something like the code below to go back and forth from
>> serializables.  The problem with lucene documents is that fields which are
>> not stored will be lost during the serialization / deserialization process.
>>
>> Dennis
>>
>> public static Object toObject(byte[] bytes, int start)
>>  throws IOException, ClassNotFoundException {
>>
>>  if (bytes == null || bytes.length == 0 || start >= bytes.length) {
>>    return null;
>>  }
>>
>>  ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
>>  bais.skip(start);
>>  ObjectInputStream ois = new ObjectInputStream(bais);
>>
>>  Object bObject = ois.readObject();
>>
>>  bais.close();
>>  ois.close();
>>
>>  return bObject;
>> }
>>
>> public static byte[] fromObject(Serializable toBytes)
>>  throws IOException {
>>
>>  ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>  ObjectOutputStream oos = new ObjectOutputStream(baos);
>>
>>  oos.writeObject(toBytes);
>>  oos.flush();
>>
>>  byte[] objBytes = baos.toByteArray();
>>
>>  baos.close();
>>  oos.close();
>>
>>  return objBytes;
>> }
>>
>>
>> Jim the Standing Bear wrote:
>>>
>>> Hello,
>>>
>>> I am not sure if this is a genuine hadoop question or more towards a
>>> core-java question.  I am hoping to create a wrapper over Lucene
>>> Document, so that this wrapper can be used for the value field of a
>>> Hadoop SequenceFile, and therefore, this wrapper must also implement
>>> the Writable interface.
>>>
>>> Lucene's Document is already made serializable, which is quite nice.
>>> However, the Writable interface definition gives only DataInput and
>>> DataOutput, and I am having a hard time trying to figure out how to
>>> serialize/deserialize an lucene Document object using
>>> DataInput/DataOutput.  In other words, how do I go from DataInput to
>>> ObjectInputStream, or from DataOutput to ObjectOutputStream?  Thanks.
>>>
>>> -- Jim
>>
>
>
>
> --
> --------------------------------------
> Standing Bear Has Spoken
> --------------------------------------
>



-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Mime
View raw message