atlas-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Madhan Neethiraj <mad...@apache.org>
Subject Re: Review Request 57649: Export API: ZIP File Size Optimization
Date Thu, 16 Mar 2017 00:06:26 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/57649/#review169083
-----------------------------------------------------------




repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStreamForImport.java
Lines 31 (patched)
<https://reviews.apache.org/r/57649/#comment241400>

    Instead of adding nextWithExtInfo(), consider updating AtlasEntityStream.next() to return
AtlasEntityWithExtInfo - as shown below:
    
    class AtlasEntityStream {
      public AtlasEntityWithExtInfo getNext() {
        return iterator.hasNext() ? new AtlasEntityWithExtInfo(iterator.next(), this.entitiesWithExtInfo)
: null; 
      }
    }
    
    With this change, following methods can be removed:
     AtlasEntityStreamForImport.nextWithExtInfo()
     AtlasEntityStreamForImport.getByGuid()
     EntityImportStream.nextWithExtInfo()



webapp/src/main/java/org/apache/atlas/web/resources/ExportService.java
Line 224 (original), 215 (patched)
<https://reviews.apache.org/r/57649/#comment241402>

    entityWithExtInfo.getReferredEntities() - this could be null. Please review all usage
and handle this case.



webapp/src/main/java/org/apache/atlas/web/resources/ExportService.java
Lines 294 (patched)
<https://reviews.apache.org/r/57649/#comment241403>

    Consider sending direction as a parameter to addToBeProcessed(guid, isLineage, direction)
and remove line #293, #297, #343.
    
    addToBeProcessed() can update the directory when isLineage=true, if needed.



webapp/src/main/java/org/apache/atlas/web/resources/ExportService.java
Line 398 (original), 395 (patched)
<https://reviews.apache.org/r/57649/#comment241404>

    "Object" ==> "Boolean"?



webapp/src/main/java/org/apache/atlas/web/resources/ExportService.java
Lines 439 (patched)
<https://reviews.apache.org/r/57649/#comment241405>

    Consider renaming: "ListOptmizedForContains" ==> "UniqueList"



webapp/src/main/java/org/apache/atlas/web/resources/ExportService.java
Lines 463 (patched)
<https://reviews.apache.org/r/57649/#comment241406>

    list.addAll() may end up adding duplicate items to the list. Consider iterating 's' and
add only elements that are not present in 'set'


- Madhan Neethiraj


On March 15, 2017, 11:04 p.m., Ashutosh Mestry wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/57649/
> -----------------------------------------------------------
> 
> (Updated March 15, 2017, 11:04 p.m.)
> 
> 
> Review request for atlas and Madhan Neethiraj.
> 
> 
> Bugs: ATLAS-1503
>     https://issues.apache.org/jira/browse/ATLAS-1503
> 
> 
> Repository: atlas
> 
> 
> Description
> -------
> 
> **Background**
> ==============
> Existing implementation of Export API w.r.t ZIP file genration adds 1 *.json* file per
entitiy. This makes ZIP file creation inefficient. The ZIP files are 75% larger in size than
what could be possible with fewer *.json* file entries.
> 
> **Solution**
> ============
> The implementation uses the new v2 API *AtlasEntityWithExtInfo* representation instead
of *AtlasEntity*. This format combines an entity with related entities as one. E.g. *hive_table*
will contain all the *hive_columns* that it is made up of. (See example section below.)
> 
> This results in significant reduction of generated *JSON* files. This impacts reduction
in generated *ZIP* file.
> 
> **Implementation Details**
> ==========================
> *Export API*
> - Modified *Gremlin* used to fetch connected entities to return *guid* with *boolean*
to indicate if the entity is process or not.
> - _ExportService_ Modified implementation to fetch *AtlasEntityWithExtInfo* instead of
*AtlasEntity*. Modified book keeping to save *process* (lineage) entities after all non-process
entities are saved.
> - _ZipSink_ Minor modification to serialize  *AtlasEntityWithExtInfo*.
> 
> *Import API*
> - _ZipSource_ Modified to source *AtlasEntityWithExtInfo*.
> - _EntityImportStream_ Modified to source *AtlasEntityWithExtInfo*.
> - _AtlasEntityStreamForImport.getGuid_ Modified  to source requested entities first from
stored *AtlasEntityWithExtInfo* object. Request from stream only if not found.
> - _AtlasEntityStoreV1.bulkImport_ Minor modification to use the new changes to stream.
> 
> 
> **Functional Areas Impacted**
> =============================
> *Export*
> - Full
> - Connected
> - HDFS path-based import.
> 
> *Import*
> - Regular flow.
> 
> **Examples**
> ============
> Case *hive_db*: Within the GraphDB the database has inward edges from objects that refer
to it. Tables in this case. So *AtlasEntityWithExtInfo* for database will not have any referred
entities.
> 
> Case of *hive_table*: Within the GraphDB the table has outward edges pointing to the
columns it is made up of. It also has edges pointing to database and storage descriptor. Hence,
the *AtlasEntityWithExtInfo* for table will have all full representation of all the columns
and reference to database and storage descriptor.
> 
> **Metrics**
> ===========
> 
> Date | File Size | No. of Entities | Export   |
>      |           |                 | Duration |
> -----|-----------|-----------------|----------|
> 3/02 |    180 MB |          202930 |   22 mins|
> 3/08 |      7 KB |               3 |    5 secs|
> ----------------------------------------------|
>                Improvement                    |           
> ----------------------------------------------|
> 3/14 |     38 MB |          202930 |   19 mins|
> 3/14 |      5 KB |               3 |    5 secs|
> 
> 
> **Summary**
> ===========
> With these changes the file size reduction is: ~65%.
> 
> 
> Diffs
> -----
> 
>   intg/src/main/java/org/apache/atlas/model/impexp/AtlasExportResult.java e6a967e 
>   intg/src/main/java/org/apache/atlas/model/instance/AtlasEntity.java 4e3895d 
>   repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStoreV1.java
cce3fca 
>   repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStream.java
5d9a7d4 
>   repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStreamForImport.java
8cb36ac 
>   repository/src/main/java/org/apache/atlas/repository/store/graph/v1/EntityImportStream.java
73994b9 
>   repository/src/main/java/org/apache/atlas/util/AtlasGremlin2QueryProvider.java 4743b73

>   webapp/src/main/java/org/apache/atlas/web/resources/AdminResource.java 31a4cf9 
>   webapp/src/main/java/org/apache/atlas/web/resources/ExportService.java c1891e0 
>   webapp/src/main/java/org/apache/atlas/web/resources/ZipSink.java 2e4cb01 
>   webapp/src/main/java/org/apache/atlas/web/resources/ZipSource.java a69f7fa 
> 
> 
> Diff: https://reviews.apache.org/r/57649/diff/1/
> 
> 
> Testing
> -------
> 
> Test data:
> - QuickStart_v1: 3 databases.
> - A *hive_db* with 922 tables.
> - Stocks *hive_db* with 1 database, table, process and 5 columns.
> - A *hive_db* with 522K entities.
> 
> The changes impact all the flows in the Export & Import APIs.
> Unit testing: Manual.
> Integration testing: Manual.
> Accuracy testing: Manual. Verified using Export -> Import -> Export -> file
compare.
> 
> 
> Thanks,
> 
> Ashutosh Mestry
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message