spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Haoyuan Li <haoyuan...@gmail.com>
Subject Re: on shark, is tachyon less efficient than memory_only cache strategy ?
Date Sun, 13 Jul 2014 18:47:01 GMT
Qingyang,

Are you asking Spark or Shark (The first email was "Shark", the last email
was "Spark".)?

Best,

Haoyuan


On Wed, Jul 9, 2014 at 7:40 PM, qingyang li <liqingyang1985@gmail.com>
wrote:

> could i set some cache policy to let spark load data from tachyon only one
> time for all sql query?  for example by using CacheAllPolicy
> FIFOCachePolicy LRUCachePolicy.  But I have tried that three policy, they
> are not useful.
> I think , if spark always load data for each sql query,  it will impact the
> query speed , it will take more time than the case that data are managed by
> spark itself.
>
>
>
>
> 2014-07-09 1:19 GMT+08:00 Haoyuan Li <haoyuan.li@gmail.com>:
>
> > Yes. For Shark, two modes, "shark.cache=tachyon" and
> "shark.cache=memory",
> > have the same ser/de overhead. Shark loads data from outsize of the
> process
> > in Tachyon mode with the following benefits:
> >
> >
> >    - In-memory data sharing across multiple Shark instances (i.e.
> stronger
> >    isolation)
> >    - Instant recovery of in-memory tables
> >    - Reduce heap size => faster GC in shark
> >    - If the table is larger than the memory size, only the hot columns
> will
> >    be cached in memory
> >
> > from http://tachyon-project.org/master/Running-Shark-on-Tachyon.html and
> > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon
> >
> > Haoyuan
> >
> >
> > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson <ilikerps@gmail.com>
> wrote:
> >
> > > Shark's in-memory format is already serialized (it's compressed and
> > > column-based).
> > >
> > >
> > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan <mridul@gmail.com>
> > > wrote:
> > >
> > > > You are ignoring serde costs :-)
> > > >
> > > > - Mridul
> > > >
> > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson <ilikerps@gmail.com>
> > > wrote:
> > > > > Tachyon should only be marginally less performant than memory_only,
> > > > because
> > > > > we mmap the data from Tachyon's ramdisk. We do not have to, say,
> > > transfer
> > > > > the data over a pipe from Tachyon; we can directly read from the
> > > buffers
> > > > in
> > > > > the same way that Shark reads from its in-memory columnar format.
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li <
> > liqingyang1985@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> hi, when i create a table, i can point the cache strategy using
> > > > >> shark.cache,
> > > > >> i think "shark.cache=memory_only"  means data are managed by
> spark,
> > > and
> > > > >> data are in the same jvm with excutor;   while
> >  "shark.cache=tachyon"
> > > > >>  means  data are managed by tachyon which is off heap, and data
> are
> > > not
> > > > in
> > > > >> the same jvm with excutor,  so spark will load data from tachyon
> for
> > > > each
> > > > >> query sql , so,  is  tachyon less efficient than memory_only
cache
> > > > strategy
> > > > >>  ?
> > > > >> if yes, can we let spark load all data once from tachyon  for
all
> > sql
> > > > query
> > > > >>  if i want to use tachyon cache strategy since tachyon is more
HA
> > than
> > > > >> memory_only ?
> > > > >>
> > > >
> > >
> >
> >
> >
> > --
> > Haoyuan Li
> > AMPLab, EECS, UC Berkeley
> > http://www.cs.berkeley.edu/~haoyuan/
> >
>



-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message