crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ioannis Kerkinos (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-505) Store intermediate data in memory only using Tachyon
Date Tue, 31 Mar 2015 18:42:55 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389140#comment-14389140
] 

Ioannis Kerkinos commented on CRUNCH-505:
-----------------------------------------

Hi Micah,
It does provide an implementation of Hadoop FileSystem, you can find it here [1].
Also I think it is quite straightforward to use. Changing the schema to "tachyon" is enough
as you can see in the example bellow.

Would it be ok if I were to start working on it? If so, do you maybe have some tips on where
to start? I've been working a bit with Tachyon for my master's thesis and I think this would
be a useful performance improvement for Crunch.

==EXAMPLE==

Spark/MapReduce without Tachyon 
• Spark 
  – val file = sc.textFile(“hdfs://ip:port/path”) 
• Hadoop MapReduce 
  – hadoop jar hadoop-­‐examples-­‐1.0.4.jar wordcount hdfs://localhost:19998/input
hdfs://localhost: 19998/output

Spark/MapReduce with Tachyon 
• Spark 
  – val file = sc.textFile(“tachyon://ip:port/path”) 
• Hadoop MapReduce 
  – hadoop jar hadoop-­‐examples-­‐1.0.4.jar wordcount tachyon://localhost:19998/input
tachyon:// localhost:19998/output

[1]-https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/hadoop/AbstractTFS.java

> Store intermediate data in memory only using Tachyon
> ----------------------------------------------------
>
>                 Key: CRUNCH-505
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-505
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.12.0
>            Reporter: Ioannis Kerkinos
>            Assignee: Josh Wills
>
> Tachyon is a memory-centric distributed storage system that enables reliable data sharing
at memory-speed. If used as the storage for intermediate data (between MR jobs) it should
improve performance as you won't have to go to HDFS. In order to do so, the MUST_CACHE write
type of Tachyon can be used. This will enable data to be persisted in memory only without
going to HDFS. So the intermediate data will be read/written at memory-speed and only the
final result will be written in HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message