oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (388J)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: OODT 0.3 branch
Date Fri, 14 Dec 2012 06:33:27 GMT
Thanks Cam, for the use cases, and insight.

Cheers,
Chris

On 12/13/12 9:03 PM, "Cameron Goodale" <goodale@apache.org> wrote:

>Chintu,
>
>I see that your test data volume is 262GB, but I am curious about the make
>up of the data.  On average what is your file size and how many files?
>
>The reason I ask is because the process of extraction and ingestion can
>vary wildly.  On the LMMP project I was ingesting 12GB DEMs over NFS and
>it
>was a slow process.  It was basically serial with 1CR+1FM, but we didn't
>have a requirement to push large volumes of data.
>
>On our recent Snow Data System I am processing 160 workflow jobs in
>parallel and OODT could handle the load, it turned out the filesystem was
>our major bottleneck.  We used a SAN initially when doing development, but
>when we increased the number of jobs in parallel the I/O became so bad we
>moved to GlusterFS. GlusterFS had speed improvements over the SAN, but we
>had to be careful about heavy writing, moving, deleting since the
>clustering would try to replicate the data.  Turns out Gluster is great
>for
>heavy writting OR heavy reading, but not both at the same time.  Finally
>we
>are using NAS and it works great.
>
>My point is the file system plays a major role in performance when
>ingesting data.  The ultimate speed test would be if you could actually
>write the data into the final archive directory and basically do an
>ingestion in place (skip data transfer entirely), but I know that is
>rarely
>possible.
>
>This is an interesting challenge to see what configuration will yield the
>best through put/performance.  I look forward to hearing more about your
>progress on this.
>
>
>Best Regards,
>
>
>
>Cameron
>
>
>On Wed, Dec 12, 2012 at 7:28 PM, Mattmann, Chris A (388J) <
>chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> Hi Chintu,
>>
>> From: <Mistry>, "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC]
>> (GSFC-586.0)" <chintu.mistry@nasa.gov<mailto:chintu.mistry@nasa.gov>>
>> Date: Wednesday, December 12, 2012 12:02 PM
>> To: jpluser <chris.a.mattmann@jpl.nasa.gov<mailto:
>> chris.a.mattmann@jpl.nasa.gov>>, "dev@oodt.apache.org<mailto:
>> dev@oodt.apache.org>" <dev@oodt.apache.org<mailto:dev@oodt.apache.org>>
>> Subject: Re: OODT 0.3 branch
>>
>> If you are saying that FM can handle multiple connections at one time,
>>
>> Yep I'm saying that it can.
>>
>> then multiple crawlers pointing to same FM should increase performance
>> significantly.
>>
>> Well that really depends to be honest. It sounds like you guys are
>>hitting
>> an IO bottleneck potentially in data transfer? What file sizes are you
>> transferring? If you are IO bound on the data transfer part, the product
>> isn't fully ingested until:
>>
>>
>>   1.  it's entry is added to the catalog
>>   2.  The data transfer finishes
>>
>> Are you checking the FM for status along the way? Also realize that the
>>FM
>> will never be faster than the file system, so if it takes the file
>>system X
>> minutes to transfer a file F1, Y to transfer F2, and Z to transfer F3,
>>then
>> you still have to wait at least the max(X,Y,Z) time, regardless for the
>>3
>> ingestions to complete.
>>
>> But that¹s not what we saw in our tests.
>>
>> For example,
>> I saw barely 2 minutes performance difference between 2FM-6CR and
>>3FM-6CR.
>>
>> 1) 2 hour  6 minutes to process 262G   (1FM 3CR - 3CR to 1FM)
>> 2) 1 hour 58 minutes to process 262G   (1FM 6CR - 6CR to 1FM)
>> 3) 1 hour 39 minutes to process 262G   (2FM 6CR - 3CR to 1FM)
>> 4) 1 hour 39 minutes to process 262G   (2FM 9CR - 4+CR to 1FM)
>> 5) 1 hour 37 minutes to process 262G   (3FM 9CR - 3CR to 1FM)
>> 6) 2 hour            to process 262G   (3FM 20CR - 6+CR to 1FM)
>> 7) 28 minutes    to process 262G   (6FM 9CR - 1+CR to 1FM)   => This is
>>my
>> latest test and this is good number.
>>
>> What would be interesting is simply looking at the speed for how long it
>> takes to cp the files (which I bet is what's happening) versus mv'ing
>>the
>> files by hand. If mv is faster, I'd:
>>
>>
>>   1.  Implement a Data Transfer implementation that simply replaces the
>> calls to FileUtils.copyFile or .moveFile with systemCalls (see
>>ExecHelper
>> from oodt-commons) to UNIX equivalents.
>>   2.  Plug that data transfer in to your crawler invocations via the cmd
>> line.
>>
>> HTH!
>>
>> Cheers,
>> Chris
>>
>>
>> From: <Mattmann>, Chris A <chris.a.mattmann@jpl.nasa.gov<mailto:
>> chris.a.mattmann@jpl.nasa.gov>>
>> Date: Wednesday, December 12, 2012 2:51 PM
>> To: "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES
>>INC]" <
>> chintu.mistry@nasa.gov<mailto:chintu.mistry@nasa.gov>>, "
>> dev@oodt.apache.org<mailto:dev@oodt.apache.org>" <dev@oodt.apache.org
>> <mailto:dev@oodt.apache.org>>
>> Subject: Re: OODT 0.3 branch
>>
>> Hey Chintu,
>>
>> From: <Mistry>, "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC]
>> (GSFC-586.0)" <chintu.mistry@nasa.gov<mailto:chintu.mistry@nasa.gov>>
>> Date: Tuesday, December 11, 2012 2:41 PM
>> To: jpluser <chris.a.mattmann@jpl.nasa.gov<mailto:
>> chris.a.mattmann@jpl.nasa.gov>>, "dev@oodt.apache.org<mailto:
>> dev@oodt.apache.org>" <dev@oodt.apache.org<mailto:dev@oodt.apache.org>>
>> Subject: Re: OODT 0.3 branch
>>
>> Answers inline below.
>>
>> ---snip
>>
>> Gotcha, so you are using different product types. So, each crawler is
>> crawling various product types in each one of the staging area dirs,
>>that
>> looks like e.g.,
>>
>> /STAGING_AREA_BASE
>>   /dir1 ­ 1st crawler
>>    - file1 of product type 1
>>    - file2 of product type 3
>>
>>  /dir2 ­ 2nd crawler
>>    - file3 of product type 3
>>
>>  /dir3 ­ 3rd crawler
>>    - file4 of product type 2
>>
>> Is that what the staging area looks like? - YES
>>
>> And then your FM is ingesting all 3 product types (I just picked 3
>> arbitrarily could have been N) into:
>>
>> ARCHIVE_BASE/{ProductTypeName}/{YYYYMMDD}
>>
>> Correct?  - YES
>>
>> If so, I would imagine if FM1 and FM2 and FM3 would actually speed up
>>the
>> ingestion process compared to just using 1 FM with 1, or 2 or 3 crawlers
>> all talking to it.
>>
>> Let me ask a few more questions:
>>
>> Do you see e.g., in the above example that file4 is ingested before
>>file2?
>> What about file3 before file2? If not, there is something wiggy going
>>on.
>>        - I have not checked that. I guess I can check that. Can FM
>>handle
>> multiple connections at the same time ?
>>
>>
>> Yep FM can handle multiple connections at one time up to a limit (I
>>think
>> hard defaulted to ~100-200 by the underlying XMLRPC 2.1 library). We're
>> using an old library currently but have a goal to upgrade to the latest
>> version where I think this # is configurable.
>>
>> Cheers,
>> Chris
>>
>>


Mime
View raw message