hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <ku...@apache.org>
Subject Re: How to make a lucene Document hadoop Writable?
Date Wed, 28 May 2008 03:29:00 GMT
When reading docs from a processed lucene index, say off disk, any 
fields that are not stored are not repopulated and will not appear in 
documents fields.  You also won't be able to read a document and then 
pass it to another indexwriter (say if you wanted to do an index 
splitter) and have it keep the unstored fields.

Dennis

Jim the Standing Bear wrote:
> Hi Dennis,
> 
> Now I see the picture.  I would love to see the code you have for
> creating complex writables - thanks for sharing it!
> 
> Since I just started to look at lucence the other day, I may once
> again misunderstand what you were saying by
> "serialization/deserialization of lucene document will lose its fields
> that are not stored".  So if I do
> 
> Document document = new Document();
>         document.add(Field.Text("author", author));
>         document.add(Field.Text("title", title));
>         document.add(Field.Text("topic", topic));
> 
> and then serialize document to a file or something, the fields will
> not be serialized?  It seems a bit odd since the Field class has also
> implemented Serializable interface.
> 
> -- Jim
> 
> On Tue, May 27, 2008 at 11:11 PM, Dennis Kubes <kubes@apache.org> wrote:
>> You can get the bytes using those methods and write them to a data output.
>>  You would probably also want to write an int before it in the stream to
>> tell the number of bytes for the object.  If you are wanting to not use the
>> java serialization process and translate an object to bytes that is a little
>> harder.
>>
>> To do it involves using reflection to get the fields of an object
>> recursively and translate those fields into their byte equivalents. Just so
>> happens that I have that functionality already developed.  We are going to
>> use it in nutch 2 to make it easy to create complex writables.  Let me know
>> if you would like the code and I will send it to you.
>>
>> Also I spoke to soon about the serialization / deserialization process.
>>  Reading a document from a Lucene index will also lose the fields that are
>> not stored so it may have nothing to do with the serialization process.
>>
>> Dennis
>>
>> Jim the Standing Bear wrote:
>>> Thanks for the quick response, Dennis.  However, your code snippet was
>>> about how to serialize/deserialize using
>>> ObjectInputStream/ObjectOutputStream.  Maybe it was my fault for not
>>> making the question clear enough - I was wondering if and how I can
>>> serialize/deserialize using only DataInput and DataOutput.
>>>
>>> This is because the Writable Interface defined by Hadoop has the
>>> following two methods:
>>>
>>> void    readFields(DataInput in)
>>>          Deserialize the fields of this object from in.
>>> void    write(DataOutput out)
>>>          Serialize the fields of this object to out
>>>
>>> so I must start with DataInput and DataOutput, and work my way to
>>> ObjectInputStream and ObjectOutputStream.  Yet I have not found a way
>>> to go from DataInput to ObjectInputStream.  Any ideas?
>>>
>>> -- Jim
>>>
>>>
>>>
>>>
>>> On Tue, May 27, 2008 at 10:50 PM, Dennis Kubes <kubes@apache.org> wrote:
>>>> You can use something like the code below to go back and forth from
>>>> serializables.  The problem with lucene documents is that fields which
>>>> are
>>>> not stored will be lost during the serialization / deserialization
>>>> process.
>>>>
>>>> Dennis
>>>>
>>>> public static Object toObject(byte[] bytes, int start)
>>>>  throws IOException, ClassNotFoundException {
>>>>
>>>>  if (bytes == null || bytes.length == 0 || start >= bytes.length) {
>>>>   return null;
>>>>  }
>>>>
>>>>  ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
>>>>  bais.skip(start);
>>>>  ObjectInputStream ois = new ObjectInputStream(bais);
>>>>
>>>>  Object bObject = ois.readObject();
>>>>
>>>>  bais.close();
>>>>  ois.close();
>>>>
>>>>  return bObject;
>>>> }
>>>>
>>>> public static byte[] fromObject(Serializable toBytes)
>>>>  throws IOException {
>>>>
>>>>  ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>>>  ObjectOutputStream oos = new ObjectOutputStream(baos);
>>>>
>>>>  oos.writeObject(toBytes);
>>>>  oos.flush();
>>>>
>>>>  byte[] objBytes = baos.toByteArray();
>>>>
>>>>  baos.close();
>>>>  oos.close();
>>>>
>>>>  return objBytes;
>>>> }
>>>>
>>>>
>>>> Jim the Standing Bear wrote:
>>>>> Hello,
>>>>>
>>>>> I am not sure if this is a genuine hadoop question or more towards a
>>>>> core-java question.  I am hoping to create a wrapper over Lucene
>>>>> Document, so that this wrapper can be used for the value field of a
>>>>> Hadoop SequenceFile, and therefore, this wrapper must also implement
>>>>> the Writable interface.
>>>>>
>>>>> Lucene's Document is already made serializable, which is quite nice.
>>>>> However, the Writable interface definition gives only DataInput and
>>>>> DataOutput, and I am having a hard time trying to figure out how to
>>>>> serialize/deserialize an lucene Document object using
>>>>> DataInput/DataOutput.  In other words, how do I go from DataInput to
>>>>> ObjectInputStream, or from DataOutput to ObjectOutputStream?  Thanks.
>>>>>
>>>>> -- Jim
>>>
>>>
> 
> 
> 

Mime
View raw message