hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim the Standing Bear" <standingb...@gmail.com>
Subject Re: How to make a lucene Document hadoop Writable?
Date Wed, 28 May 2008 03:25:22 GMT
Hi Dennis,

Now I see the picture.  I would love to see the code you have for
creating complex writables - thanks for sharing it!

Since I just started to look at lucence the other day, I may once
again misunderstand what you were saying by
"serialization/deserialization of lucene document will lose its fields
that are not stored".  So if I do

Document document = new Document();
        document.add(Field.Text("author", author));
        document.add(Field.Text("title", title));
        document.add(Field.Text("topic", topic));

and then serialize document to a file or something, the fields will
not be serialized?  It seems a bit odd since the Field class has also
implemented Serializable interface.

-- Jim

On Tue, May 27, 2008 at 11:11 PM, Dennis Kubes <kubes@apache.org> wrote:
> You can get the bytes using those methods and write them to a data output.
>  You would probably also want to write an int before it in the stream to
> tell the number of bytes for the object.  If you are wanting to not use the
> java serialization process and translate an object to bytes that is a little
> harder.
> To do it involves using reflection to get the fields of an object
> recursively and translate those fields into their byte equivalents. Just so
> happens that I have that functionality already developed.  We are going to
> use it in nutch 2 to make it easy to create complex writables.  Let me know
> if you would like the code and I will send it to you.
> Also I spoke to soon about the serialization / deserialization process.
>  Reading a document from a Lucene index will also lose the fields that are
> not stored so it may have nothing to do with the serialization process.
> Dennis
> Jim the Standing Bear wrote:
>> Thanks for the quick response, Dennis.  However, your code snippet was
>> about how to serialize/deserialize using
>> ObjectInputStream/ObjectOutputStream.  Maybe it was my fault for not
>> making the question clear enough - I was wondering if and how I can
>> serialize/deserialize using only DataInput and DataOutput.
>> This is because the Writable Interface defined by Hadoop has the
>> following two methods:
>> void    readFields(DataInput in)
>>          Deserialize the fields of this object from in.
>> void    write(DataOutput out)
>>          Serialize the fields of this object to out
>> so I must start with DataInput and DataOutput, and work my way to
>> ObjectInputStream and ObjectOutputStream.  Yet I have not found a way
>> to go from DataInput to ObjectInputStream.  Any ideas?
>> -- Jim
>> On Tue, May 27, 2008 at 10:50 PM, Dennis Kubes <kubes@apache.org> wrote:
>>> You can use something like the code below to go back and forth from
>>> serializables.  The problem with lucene documents is that fields which
>>> are
>>> not stored will be lost during the serialization / deserialization
>>> process.
>>> Dennis
>>> public static Object toObject(byte[] bytes, int start)
>>>  throws IOException, ClassNotFoundException {
>>>  if (bytes == null || bytes.length == 0 || start >= bytes.length) {
>>>   return null;
>>>  }
>>>  ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
>>>  bais.skip(start);
>>>  ObjectInputStream ois = new ObjectInputStream(bais);
>>>  Object bObject = ois.readObject();
>>>  bais.close();
>>>  ois.close();
>>>  return bObject;
>>> }
>>> public static byte[] fromObject(Serializable toBytes)
>>>  throws IOException {
>>>  ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>>  ObjectOutputStream oos = new ObjectOutputStream(baos);
>>>  oos.writeObject(toBytes);
>>>  oos.flush();
>>>  byte[] objBytes = baos.toByteArray();
>>>  baos.close();
>>>  oos.close();
>>>  return objBytes;
>>> }
>>> Jim the Standing Bear wrote:
>>>> Hello,
>>>> I am not sure if this is a genuine hadoop question or more towards a
>>>> core-java question.  I am hoping to create a wrapper over Lucene
>>>> Document, so that this wrapper can be used for the value field of a
>>>> Hadoop SequenceFile, and therefore, this wrapper must also implement
>>>> the Writable interface.
>>>> Lucene's Document is already made serializable, which is quite nice.
>>>> However, the Writable interface definition gives only DataInput and
>>>> DataOutput, and I am having a hard time trying to figure out how to
>>>> serialize/deserialize an lucene Document object using
>>>> DataInput/DataOutput.  In other words, how do I go from DataInput to
>>>> ObjectInputStream, or from DataOutput to ObjectOutputStream?  Thanks.
>>>> -- Jim

Standing Bear Has Spoken

View raw message