gora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alfonso Nishikawa (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (GORA-401) Serialization and deserialization of Persistent does not hold the entity dirty state
Date Tue, 13 Jan 2015 00:35:35 GMT

     [ https://issues.apache.org/jira/browse/GORA-401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alfonso Nishikawa updated GORA-401:
-----------------------------------
    Attachment: GORA-401-tests.patch

Test to check the dirty state on serialization when running map reduce on HBase. By some weird
reason, after deserialization this time the dirty state is "not dirty" (not like in other
machine I was before).

The test implements a map that gets values from Query("url"), modifies that url, and emits
the same key with the modified value, and a reducer that just emits the values.
The expected final value is the original WebPage values with "url" changed to "hola" in all
WebPages (as changed in the map side).
The test only checks the fields "ulr" (="hola") and content (!= null).

If dirty state after deserialization is always true or always false, the test will fail.

Please check and comment if I am doing something wrong in the tests.

I will go implementing the fix to see if the tests passes.

Thanks!

> Serialization and deserialization of Persistent does not hold the entity dirty state
> ------------------------------------------------------------------------------------
>
>                 Key: GORA-401
>                 URL: https://issues.apache.org/jira/browse/GORA-401
>             Project: Apache Gora
>          Issue Type: Bug
>          Components: gora-core
>    Affects Versions: 0.4, 0.5
>         Environment: Tested on gora-0.4, but seems logically to hold on gora-0.5. HBase
backend.
>            Reporter: Alfonso Nishikawa
>            Priority: Critical
>              Labels: serialization
>         Attachments: GORA-401-tests.patch
>
>   Original Estimate: 35h
>  Remaining Estimate: 35h
>
> After removing __g__dirty field in GORA-326, dirty field is not serialized. In GORA-321
{{[PersistentSerializer|https://github.com/apache/gora/blob/master/gora-core/src/main/java/org/apache/gora/mapreduce/PersistentSerializer.java]}}
went from using {{[PersistentDatumWriter|https://github.com/apache/gora/blob/apache-gora-0.3/gora-core/src/main/java/org/apache/gora/avro/PersistentDatumWriter.java](/Reader)}}
to Avro's {{SpecificDatumWriter}}, delegating the serialization of the dirty field to Avro
(but really not desirable to have that field as a main field in the entities).
> The proposal is to reintroduce the {{PersistentDatumWriter/Reader}} which will serialize
the internal fields of the entities.
> This bug affects, for example, Nutch, which loads only some fields in it's phases, serializes
entities (from Map to Reduce), and when deserializes finds all fields as "dirty", independently
of what fields were modified in the Map, and overwrite all data in datastore (deleting much
things: downloaded content, parsed content, etc).
> This effect can be seen in {{TestPersistentSerialization#testSerderEmployeeTwoFields}},
when debuging in {{TestIOUtils#testSerializeDeserialize}}. Proper breakpoints an inspections
shows that, entities are "equal" when it's fields are equal. This is fine as "equal" definition,
but another test must be added to check that serialization an deserialization keeps the dirty
state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message