nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Flowers <cflow...@onyxpoint.com>
Subject Re: Nifi hardware recommendation
Date Fri, 14 Oct 2016 21:03:04 GMT
We actually use heap sizes from 32 to 64Gb for ours but our volumes and graphs are both extremely
large. Although I believe the smaller heap sizes were a limitation of the garbage collection
in Java 7. We also moved to ssd drives, which did help through put quite a bit. Our systems
were actually requesting the creation and removal of file handles faster than traditional
disks could keep up with (we believe). In addition, unlike with traditional drives where we
tired to minimize caching, we actually forced more disk caching when we moved to ssds. Still
waiting to see the results of that on our volumes, although it does seemed to have help. Also
remember, depending on how you code them, individual processors can use system memory outside
of the heap. So you need to take that into consideration when designing the servers. 

Sent from my iPhone

> On Oct 14, 2016, at 1:36 PM, Joe Witt <joe.witt@gmail.com> wrote:
> 
> Russ,
> 
> You can definitely find a lot of material on the Internet about Java heap sizes, types
of garbage collectors, application usage patterns.  By all means please do experiment with
different sizes appropriate for your case.  We're not saying NiFi itself has any problem with
large heaps.
> 
> Thanks
> Joe
> 
>> On Fri, Oct 14, 2016 at 12:44 PM, Russell Bateman <russell.bateman@perfectsearchcorp.com>
wrote:
>> Ali,
>> 
>> "not recommended to dedicate more than 8-10 GM to JVM heap space" by whom? Do you
have links/references establishing this? I couldn't find anyone saying this or why.
>> 
>> Russ
>> 
>>> On 10/13/2016 05:47 PM, Ali Nazemian wrote:
>>> Hi,
>>> 
>>> I have another question regarding the hardware recommendation. As far as I found
out, Nifi uses on-heap memory currently, and it will not try to load the whole object in memory.
From the garbage collection perspective, it is not recommended to dedicate more than 8-10
GB to JVM heap space. In this case, may I say spending money on system memory is useless?
Probably 16 GB per each system is enough according to this architecture. Unless some architecture
changes appear in the future to use off-heap memory as well. However, I found some articles
about best practices, and in terms of memory recommendation it does not make sense. Would
you please clarify this part for me?
>>> Thank you very much.
>>> 
>>> Best regards,
>>> Ali
>>> 
>>> 
>>> On Thu, Oct 13, 2016 at 11:38 PM, Ali Nazemian <alinazemian@gmail.com>
wrote:
>>>> Thank you very much. 
>>>> I would be more than happy to provide some benchmark results after the implementation.

>>>> 
>>>> Sincerely yours,
>>>> Ali
>>>> 
>>>>> On Thu, Oct 13, 2016 at 11:32 PM, Joe Witt <joe.witt@gmail.com>
wrote:
>>>>> Ali,
>>>>> 
>>>>> I agree with your assumption.  It would be great to test that out and
provide some numbers but intuitively I agree.
>>>>> 
>>>>> I could envision certain scatter/gather data flows that could challenge
that sequential access assumption but honestly with how awesome disk caching is in Linux these
days in think practically speaking this is the right way to think about it.
>>>>> 
>>>>> Thanks
>>>>> Joe
>>>>> 
>>>>> On Thu, Oct 13, 2016 at 8:29 AM, Ali Nazemian <alinazemian@gmail.com>
wrote:
>>>>>> Dear Joe,
>>>>>> 
>>>>>> Thank you very much. That was a really great explanation. 
>>>>>> I investigated the Nifi architecture, and it seems that most of the
read/write operations for flow file repo and provenance repo are random. However, for content
repo most of the read/write operations are sequential. Let's say cost does not matter. In
this case, even choosing SSD for content repo can not provide huge performance gain instead
of HDD. Am I right? Hence, it would be better to spend content repo SSD money on network infrastructure.
>>>>>> 
>>>>>> Best regards,
>>>>>> Ali
>>>>>> 
>>>>>> 
>>>>>>> On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt <joe.witt@gmail.com>
wrote:
>>>>>>> Ali,
>>>>>>> 
>>>>>>> You have a lot of nice resources to work with there.  I'd recommend
the series of RAID-1 configuration personally provided you keep in mind this means you can
only lose a single disk for any one partition.  As long as they're being monitored and would
be quickly replaced this in practice works well.  If there could be lapses in monitoring or
time to replace then it is perhaps safer to go with more redundancy or an alternative RAID
type.
>>>>>>> 
>>>>>>> I'd say do the OS, app installs w/user and audit db stuff, application
logs on one physical RAID volume.  Have a dedicated physical volume for the flow file repository.
 It will not be able to use                                               all the space but
it certainly could benefit from having no other contention.  This could be a great thing to
have SSDs for actually.  And for the remaining volumes split them up for content and provenance
as you have.  You get to make the overall performance versus retention decision.         
                                      Frankly, you have a great system to work with and I
suspect you're going to                                               see excellent results
anyway.
>>>>>>> 
>>>>>>> Conservatively speaking expect say 50MB/s of throughput per volume
in the content repository so if you end up with 8 of them could achieve upwards of 400MB/s
sustained.  You'll also then want to make sure you have a good 10G based network setup as
well.  Or, you could dial back on the speed tradeoff and simply increase retention or disk
loss tolerance.  Lots of ways to play the game.
>>>>>>> 
>>>>>>> There are no published SSD vs HDD performance benchmarks that
I am aware of though this is a good idea.  Having a hybrid of SSDs and HDDs could offer a
really solid performance/retention/cost tradeoff.  For example having SSDs for the OS/logs/provenance/flowfile
with HDDs for the content - that would be quite nice.  At that rate to take full advantage
of the system you'd need to have very strong network infrastructure between NiFi and any systems
it is interfacing with  and your flows would need to be well tuned for GC/memory efficiency.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Joe 
>>>>>>> 
>>>>>>> On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian <alinazemian@gmail.com>
wrote:
>>>>>>>> Dear Nifi Users/ developers,
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I was wondering is there any benchmark about the question
that is it better to dedicate disk control to Nifi                                       
                 or using RAID for this purpose? For example, which of these scenarios is
recommended from the performance point of view? 
>>>>>>>> Scenario 1: 
>>>>>>>> 24 disk in total
>>>>>>>> 2 disk- raid 1 for OS and fileflow repo
>>>>>>>> 2 disk- raid 1 for provenance repo1
>>>>>>>> 2 disk- raid 1 for provenance repo2
>>>>>>>> 2 disk- raid 1 for content repo1
>>>>>>>> 2 disk- raid 1 for content repo2
>>>>>>>> 2 disk- raid 1 for content repo3
>>>>>>>> 2 disk- raid 1 for content repo4
>>>>>>>> 2 disk- raid 1 for content repo5
>>>>>>>> 2 disk- raid 1 for content repo6
>>>>>>>> 2 disk- raid 1 for content repo7
>>>>>>>> 2 disk- raid 1 for content repo8
>>>>>>>> 2 disk- raid 1 for content repo9
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Scenario 2: 
>>>>>>>> 24 disk in total
>>>>>>>> 2 disk- raid 1 for OS and fileflow repo
>>>>>>>> 4 disk- raid 10 for provenance repo1
>>>>>>>> 18 disk- raid 10 for content repo1
>>>>>>>> 
>>>>>>>> Moreover, is there any benchmark for SSD vs HDD performance
for Nifi?
>>>>>>>> Thank you very much.
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> Ali
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> A.Nazemian
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> A.Nazemian
>>> 
>>> 
>>> 
>>> -- 
>>> A.Nazemian
>> 
> 

Mime
View raw message