gora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Renato Javier Marroquín Mogrovejo (JIRA) <j...@apache.org>
Subject [jira] [Commented] (GORA-401) Serialization and deserialization of Persistent does not hold the entity dirty state
Date Wed, 17 Dec 2014 16:57:13 GMT

    [ https://issues.apache.org/jira/browse/GORA-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250112#comment-14250112
] 

Renato Javier Marroquín Mogrovejo commented on GORA-401:
--------------------------------------------------------

Hi guys,
So I think [~lewismc] actually explained pretty well on the mailing list. These files were
used when we were using our old data beans compiler and not Avro's. For serializing the internal
fields of the entities, what do you mean? the __g__dirty? that is just an in-memory data structure
to verify that the entity hasn't changed while we had it. Also Ed did this when he re-worked
the Avro compiler.
For the example of using the object from Map to Reduce, this doesn't assure that the Reducer
will be on the same machine as the Mapper. This means that they may not share memory which
in turn means that you need to read the entity from disk again and that should be not dirty
(you are just first reading it, no changes yet). Dirty state is just an in-memory per-process.
So what would happen if you have two mappers reading the same entity in different machines
and modifying the entity differently. how would you determine on the reducer which one is
the correct one?
Probably Nutch problem is something else, I will try to look into it.

> Serialization and deserialization of Persistent does not hold the entity dirty state
> ------------------------------------------------------------------------------------
>
>                 Key: GORA-401
>                 URL: https://issues.apache.org/jira/browse/GORA-401
>             Project: Apache Gora
>          Issue Type: Bug
>          Components: gora-core
>    Affects Versions: 0.4, 0.5
>         Environment: Tested on gora-0.4, but seems logically to hold on gora-0.5
>            Reporter: Alfonso Nishikawa
>            Priority: Critical
>              Labels: serialization
>   Original Estimate: 35h
>  Remaining Estimate: 35h
>
> After removing __g__dirty field in GORA-326, dirty field is not serialized. In GORA-321
{{[PersistentSerializer|https://github.com/apache/gora/blob/master/gora-core/src/main/java/org/apache/gora/mapreduce/PersistentSerializer.java]}}
went from using {{[PersistentDatumWriter|https://github.com/apache/gora/blob/apache-gora-0.3/gora-core/src/main/java/org/apache/gora/avro/PersistentDatumWriter.java](/Reader)}}
to Avro's {{SpecificDatumWriter}}, delegating the serialization of the dirty field to Avro
(but really not desirable to have that field as a main field in the entities).
> The proposal is to reintroduce the {{PersistentDatumWriter/Reader}} which will serialize
the internal fields of the entities.
> This bug affects, for example, Nutch, which loads only some fields in it's phases, serializes
entities (from Map to Reduce), and when deserializes finds all fields as "dirty", independently
of what fields were modified in the Map, and overwrite all data in datastore (deleting much
things: downloaded content, parsed content, etc).
> This effect can be seen in {{TestPersistentSerialization#testSerderEmployeeTwoFields}},
when debuging in {{TestIOUtils#testSerializeDeserialize}}. Proper breakpoints an inspections
shows that, entities are "equal" when it's fields are equal. This is fine as "equal" definition,
but another test must be added to check that serialization an deserialization keeps the dirty
state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message