crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ioannis Kerkinos (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-505) Store intermediate data in memory only using Tachyon
Date Tue, 31 Mar 2015 18:42:55 GMT


Ioannis Kerkinos commented on CRUNCH-505:

Hi Micah,
It does provide an implementation of Hadoop FileSystem, you can find it here [1].
Also I think it is quite straightforward to use. Changing the schema to "tachyon" is enough
as you can see in the example bellow.

Would it be ok if I were to start working on it? If so, do you maybe have some tips on where
to start? I've been working a bit with Tachyon for my master's thesis and I think this would
be a useful performance improvement for Crunch.


Spark/MapReduce without Tachyon 
• Spark 
  – val file = sc.textFile(“hdfs://ip:port/path”) 
• Hadoop MapReduce 
  – hadoop jar hadoop-­‐examples-­‐1.0.4.jar wordcount hdfs://localhost:19998/input
hdfs://localhost: 19998/output

Spark/MapReduce with Tachyon 
• Spark 
  – val file = sc.textFile(“tachyon://ip:port/path”) 
• Hadoop MapReduce 
  – hadoop jar hadoop-­‐examples-­‐1.0.4.jar wordcount tachyon://localhost:19998/input
tachyon:// localhost:19998/output


> Store intermediate data in memory only using Tachyon
> ----------------------------------------------------
>                 Key: CRUNCH-505
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.12.0
>            Reporter: Ioannis Kerkinos
>            Assignee: Josh Wills
> Tachyon is a memory-centric distributed storage system that enables reliable data sharing
at memory-speed. If used as the storage for intermediate data (between MR jobs) it should
improve performance as you won't have to go to HDFS. In order to do so, the MUST_CACHE write
type of Tachyon can be used. This will enable data to be persisted in memory only without
going to HDFS. So the intermediate data will be read/written at memory-speed and only the
final result will be written in HDFS.

This message was sent by Atlassian JIRA

View raw message