gora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alfonso Nishikawa (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GORA-401) Serialization and deserialization of Persistent does not hold the entity dirty state from Map to Reduce
Date Wed, 21 Jan 2015 17:47:35 GMT

    [ https://issues.apache.org/jira/browse/GORA-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14285963#comment-14285963
] 

Alfonso Nishikawa commented on GORA-401:
----------------------------------------

Hi all, and hi [~renato2099] (answering).
I disassigned me the issue because it is a VERY complex issue and help will be needed for
the datastores but HBase. I uploaded a Q&D (Quick & Dirty) patch that surely works
for HBase, while the other datastores must be checked (I don't have time at this moment to
make a complete mvn test, have to leave in a few minutes). An elegant and engineering solution
seems a bit hard, and what I did is revert part of GORA-321 (and some additional work) I will
detail the changes bellow.

Before, just comment that _I think_ I achieved to get dirty bytes serialized on Map Reduce
and not serialized when persisting.

Why is Quick and Dirty?
- There is a need for a `FakeResolvingDecoder` (reverted), but now Avro's `ResolvingDecoder`
has a package constructor (when gora 0.3, it was public). So: had to put FakeResolvingDecoder
in that package (dirty, dirty!) and export in osgi. *Please, someone check what I wrote in
osgi.export in `gora-core/pom.xml`*. I don't know anything about osgi and I wrote something
the best as I could.
- Recreated MockPersistent, which was outdated, and did not exists an .json.
- Reverted a plethora of classes.
- Modified `PersistentDatumReader#readRecord` to return `Object` because sometimes returns
a record, sometimes return other things (specifically in unions). This was not happening in
Gora-0.3, or at least not detected. A big pain to debug and fix.
- HBaseByteInterface#toBytes() / fromBytes() uses SpecificDatumWriter/Reader, so no dirty
bytes are serialized/deserialized when writing to the dataStore.
- Need public `getDirtyBytes()` and `setDirtyBytes()` in PersistentBase to get and restore
the dirty bytes when serializing.

Maybe some test about checking the dirty state will have to be improved.

(and the code in PersistentDatumWriter must be improved, please, don't look at it. I am embarrassed
and I had no time to fix it)

Comments? Help with the rest of datastores?

Thanks!

> Serialization and deserialization of Persistent does not hold the entity dirty state
from Map to Reduce
> -------------------------------------------------------------------------------------------------------
>
>                 Key: GORA-401
>                 URL: https://issues.apache.org/jira/browse/GORA-401
>             Project: Apache Gora
>          Issue Type: Bug
>          Components: gora-core
>    Affects Versions: 0.4, 0.5
>         Environment: Tested on gora-0.4, but seems logically to hold on gora-0.5. HBase
backend.
>            Reporter: Alfonso Nishikawa
>            Priority: Critical
>              Labels: serialization
>         Attachments: GORA-401-tests.patch, GORA-401v1.patch
>
>   Original Estimate: 35h
>          Time Spent: 21h
>  Remaining Estimate: 14h
>
> After removing __g__dirty field in GORA-326, dirty field is not serialized. In GORA-321
{{[PersistentSerializer|https://github.com/apache/gora/blob/master/gora-core/src/main/java/org/apache/gora/mapreduce/PersistentSerializer.java]}}
went from using {{[PersistentDatumWriter|https://github.com/apache/gora/blob/apache-gora-0.3/gora-core/src/main/java/org/apache/gora/avro/PersistentDatumWriter.java](/Reader)}}
to Avro's {{SpecificDatumWriter}}, delegating the serialization of the dirty field to Avro
(but really not desirable to have that field as a main field in the entities).
> The proposal is to reintroduce the {{PersistentDatumWriter/Reader}} which will serialize
the internal fields of the entities.
> This bug affects, for example, Nutch, which loads only some fields in it's phases, serializes
entities (from Map to Reduce), and when deserializes finds all fields as "dirty", independently
of what fields were modified in the Map, and overwrite all data in datastore (deleting much
things: downloaded content, parsed content, etc).
> This effect can be seen in {{TestPersistentSerialization#testSerderEmployeeTwoFields}},
when debuging in {{TestIOUtils#testSerializeDeserialize}}. Proper breakpoints an inspections
shows that, entities are "equal" when it's fields are equal. This is fine as "equal" definition,
but another test must be added to check that serialization an deserialization keeps the dirty
state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message