Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of yiming.sun@gmail.com
 designates 74.125.82.44 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <8FADD602-74E1-475C-A323-6FB68F41B7AB@thelastpickle.com>
References: 
 <CABxBLH_NAZ9M1p=Hav36xQ+ZHMSbOSy1-JJjkxYzJQp1HQkg6A@mail.gmail.com>
 <CAAam9ssrHaPXFpXN_TpNKoezM9KZ39Zf0H43N3G5Tn8fLoOyUg@mail.gmail.com>
 <CABxBLH80rMZUwB-KN5=teg6jDx0ifvC=NwcY3c7-R90+8wi8NQ@mail.gmail.com>
 <8FADD602-74E1-475C-A323-6FB68F41B7AB@thelastpickle.com>
From: Yiming Sun <yiming.sun@gmail.com>
Date: Wed, 16 May 2012 08:35:49 -0400
Message-ID: 
 <CABxBLH8jkJsCOEtjnhddStrTH2RZks7bk+jmBouanBrVFVYZoA@mail.gmail.com>
Subject: Re: need some clarification on recommended memory size
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=f46d0418263851933204c02690ff

--f46d0418263851933204c02690ff
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Thanks Aaron.  The reason I raised the question about memory requirements
is because we are seeing some very low performance on cassandra read.

We are using cassandra as the backend for an IR repository, and granted the
size of each column is very small (OCRed text).  Each row represents a book
volume, and the columns of the row represent pages of the volume.  The
average size of a column text is 2-3KB, and each row has about 250 columns
(varies quite a bit from one volume to another).

The read rate that I have been seeing is about 3MB/sec, and that is reading
the raw bytes... using string serializer the rate is even lower, about
2.2MB/sec.   To retrieve each volume, a slice query is used via Hector that
specifies the row key (the volume), and a list of column keys (pages), and
the consistency level is set to ONE.  So I am a bit lost in trying to
figure out how to increase the performance.  Using JNA may help, but a blog
article seems to say it only increase 13%, which is not very significant
when the base performance is in single-digit MBs.

Do you have any suggestions?

Oh, another thing is you mentioned memory mapped files.  Our environment is
virtualized, and the disks are actually SAN through fiber channels, so I
don't know if that has impact on performance as well.  Would greatly
appreciate any help.  Thanks.

-- Y.

On Wed, May 16, 2012 at 5:48 AM, aaron morton <aaron@thelastpickle.com>wrot=
e:

> The JVM will not swap out if you have JNA.jar in the path or you have
> disabled swap on the machine (the simplest thing to do).
>
> Cassandra uses memory mapped file access. If you have 16GB of ram, 8 will
> go to the JVM and the rest can be used by the os to cache files. (Plus th=
e
> off heap stuff)
>
> Cheers
>
>   -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 16/05/2012, at 11:12 AM, Yiming Sun wrote:
>
> Thanks Tyler... so my understanding is, even if Cassandra doesn't do
> off-heap caching, by having a large-enough memory, it minimize the chance
> of swapping the java heap to a disk.  Is that correct?
>
> -- Y.
>
> On Tue, May 15, 2012 at 6:26 PM, Tyler Hobbs <tyler@datastax.com> wrote:
>
>> On Tue, May 15, 2012 at 3:19 PM, Yiming Sun <yiming.sun@gmail.com> wrote=
:
>>
>>> Hello,
>>>
>>> I was reading the Apache Cassandra 1.0 Documentation PDF dated May 10,
>>> 2012, and had some questions on what the recommended memory size is.
>>>
>>> Below is the snippet from the PDF.  Bullet 1 suggests to have 16-32GB o=
f
>>> RAM, yet Bullet 2 suggests to limit Java heap size to no more than 8GB.=
  My
>>> understanding is that Cassandra is implemented purely in Java, so all
>>> memory it sees and uses is the JVM Heap.
>>>
>>
>> The main way that additional RAM helps is through the OS page cache,
>> which will store hot portions of SSTables in memory. Additionally,
>> Cassandra can now do off-heap caching.
>>
>>
>>
>>>   So can someone help me understand the discrepancy between 16-32GB of
>>> RAM and 8GB of heap?  Thanks.
>>>
>>> =3D=3D snippet =3D=3D
>>> Memory
>>> The more memory a Cassandra node has, the better read performance. More
>>> RAM allows for larger cache sizes and
>>> reduces disk I/O for reads. More RAM also allows memory tables
>>> (memtables) to hold more recently written data. Larger
>>> memtables lead to a fewer number of SSTables being flushed to disk and
>>> fewer files to scan during a read. The ideal
>>> amount of RAM depends on the anticipated size of your hot data.
>>>
>>> =95 For dedicated hardware, a minimum of than 8GB of RAM is needed.
>>> DataStax recommends 16GB - 32GB.
>>>
>>> =95 Java heap space should be set to a maximum of 8GB or half of your
>>> total RAM, whichever is lower. (A greater
>>> heap size has more intense garbage collection periods.)
>>>
>>> =95 For a virtual environment use a minimum of 4GB, such as Amazon EC2
>>> Large instances. For production clusters
>>> with a healthy amount of traffic, 8GB is more common.
>>>
>>
>>
>>
>> --
>> Tyler Hobbs
>> DataStax <http://datastax.com/>
>>
>>
>
>

--f46d0418263851933204c02690ff
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Thanks Aaron. =A0The reason I raised the question about memory requirements=
 is because we are seeing some very low performance on cassandra read.<div>=
<br></div><div>We are using cassandra as the backend for an IR repository, =
and granted the size of each column is very small (OCRed text). =A0Each row=
 represents a book volume, and the columns of the row represent pages of th=
e volume. =A0The average size of a column text is 2-3KB, and each row has a=
bout 250 columns (varies quite a bit from one volume to another).</div>

<div><br></div><div>The read rate that I have been seeing is about 3MB/sec,=
 and that is reading the raw bytes... using string serializer the rate is e=
ven lower, about 2.2MB/sec. =A0 To retrieve each volume, a slice query is u=
sed via Hector that specifies the row key (the volume), and a list of colum=
n keys (pages), and the consistency level is set to ONE. =A0So I am a bit l=
ost in trying to figure out how to increase the performance. =A0Using JNA m=
ay help, but a blog article seems to say it only increase 13%, which is not=
 very significant when the base performance is in single-digit MBs.</div>

<div><br></div><div>Do you have any suggestions?</div><div><br></div><div>O=
h, another thing is you mentioned memory mapped files. =A0Our environment i=
s virtualized, and the disks are actually SAN through fiber channels, so I =
don&#39;t know if that has impact on performance as well. =A0Would greatly =
appreciate any help. =A0Thanks.</div>

<div><br></div><div>-- Y.</div><div><br><div class=3D"gmail_quote">On Wed, =
May 16, 2012 at 5:48 AM, aaron morton <span dir=3D"ltr">&lt;<a href=3D"mail=
to:aaron@thelastpickle.com" target=3D"_blank">aaron@thelastpickle.com</a>&g=
t;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div style=3D"word-wrap:break-word">The JVM =
will not swap out if you have JNA.jar in the path or you have disabled swap=
 on the machine (the simplest thing to do).=A0<div>

<br></div><div>Cassandra uses memory mapped file access. If you have 16GB o=
f ram, 8 will go to the JVM and the rest can be used by the os to cache fil=
es. (Plus the off heap stuff)</div><div><br></div><div>Cheers</div><div>

=A0<br><div>
<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;te=
xt-align:-webkit-auto;font-style:normal;font-weight:normal;line-height:norm=
al;border-collapse:separate;text-transform:none;font-size:medium;white-spac=
e:normal;font-family:Helvetica;word-spacing:0px"><span style=3D"text-indent=
:0px;letter-spacing:normal;font-variant:normal;font-style:normal;font-weigh=
t:normal;line-height:normal;border-collapse:separate;text-transform:none;fo=
nt-size:medium;white-space:normal;font-family:Helvetica;word-spacing:0px"><=
div style=3D"word-wrap:break-word">

<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;fo=
nt-style:normal;font-weight:normal;line-height:normal;border-collapse:separ=
ate;text-transform:none;font-size:medium;white-space:normal;font-family:Hel=
vetica;word-spacing:0px"><div style=3D"word-wrap:break-word">

<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;fo=
nt-style:normal;font-weight:normal;line-height:normal;border-collapse:separ=
ate;text-transform:none;font-size:medium;white-space:normal;font-family:Hel=
vetica;word-spacing:0px"><div style=3D"word-wrap:break-word">

<div><div>-----------------</div><div>Aaron Morton</div><div>Freelance Deve=
loper</div><div>@aaronmorton</div><div><a href=3D"http://www.thelastpickle.=
com" target=3D"_blank">http://www.thelastpickle.com</a></div></div></div></=
span></div>

</span></div></span></span>
</div><div><div class=3D"h5">

<br><div><div>On 16/05/2012, at 11:12 AM, Yiming Sun wrote:</div><br><block=
quote type=3D"cite">Thanks Tyler... so my understanding is, even if Cassand=
ra doesn&#39;t do off-heap caching, by having a large-enough memory, it min=
imize the chance of swapping the java heap to a disk. =A0Is that correct?<d=
iv>

<br></div><div>-- Y.<br>

<br><div class=3D"gmail_quote">On Tue, May 15, 2012 at 6:26 PM, Tyler Hobbs=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:tyler@datastax.com" target=3D"_bla=
nk">tyler@datastax.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex">


<div class=3D"gmail_quote"><div>On Tue, May 15, 2012 at 3:19 PM, Yiming Sun=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:yiming.sun@gmail.com" target=3D"_b=
lank">yiming.sun@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex">


Hello,<div><br></div><div>I was reading the Apache Cassandra 1.0 Documentat=
ion PDF dated May 10, 2012, and had some questions on what the recommended =
memory size is.</div><div><br></div><div>Below is the snippet from the PDF.=
 =A0Bullet 1 suggests to have 16-32GB of RAM, yet Bullet 2 suggests to limi=
t Java heap size to no more than 8GB. =A0My understanding is that Cassandra=
 is implemented purely in Java, so all memory it sees and uses is the JVM H=
eap.</div>


</blockquote></div><div><br>The main way that additional RAM helps is throu=
gh the OS page cache, which will store hot portions of SSTables in memory. =
Additionally, Cassandra can now do off-heap caching.<br><br>=A0</div><div>


<blockquote class=3D"gmail_quote" style=3D"margin:0pt 0pt 0pt 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex">
<div> =A0So can someone help me understand the discrepancy between 16-32GB =
of RAM and 8GB of heap? =A0Thanks.</div>

<div><br></div><div>=3D=3D snippet =3D=3D</div><div><div>Memory</div><div>T=
he more memory a Cassandra node has, the better read performance. More RAM =
allows for larger cache sizes and</div><div>reduces disk I/O for reads. Mor=
e RAM also allows memory tables (memtables) to hold more recently written d=
ata. Larger</div>


<div>memtables lead to a fewer number of SSTables being flushed to disk and=
 fewer files to scan during a read. The ideal</div><div>amount of RAM depen=
ds on the anticipated size of your hot data.</div><div><br></div><div>


=95 For dedicated hardware, a minimum of than 8GB of RAM is needed. DataSta=
x recommends 16GB - 32GB.</div>
<div><br></div><div>=95 Java heap space should be set to a maximum of 8GB o=
r half of your total RAM, whichever is lower. (A greater</div><div>heap siz=
e has more intense garbage collection periods.)</div><div><br></div><div>


=95 For a virtual environment use a minimum of 4GB, such as Amazon EC2 Larg=
e instances. For production clusters</div><div>with a healthy amount of tra=
ffic, 8GB is more common.</div></div>
</blockquote></div></div><span><font color=3D"#888888"><br><br clear=3D"all=
"><br>-- <br><font color=3D"#888888">Tyler Hobbs<span></span><br>
<a href=3D"http://datastax.com/" target=3D"_blank">DataStax</a><br></font><=
br>
</font></span></blockquote></div><br></div>
</blockquote></div><br></div></div></div></div></blockquote></div><br></div=
>

--f46d0418263851933204c02690ff--