oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]" <chintu.mis...@nasa.gov>
Subject Re: OODT 0.3 branch
Date Tue, 11 Dec 2012 22:41:32 GMT
Answers inline below.

We will share information on apache.org at one point, but we are not there yet.

--
Chintu Mistry
NASA Goddard Space Flight Center
Bldg L40B, Room S776
Office: 240 684 0477
Mobile: 770 310 1047

From: <Mattmann>, Chris A <chris.a.mattmann@jpl.nasa.gov<mailto:chris.a.mattmann@jpl.nasa.gov>>
Date: Tuesday, December 11, 2012 5:23 PM
To: "Mistry, Chintu (GSFC-586.0)[COLUMBUS TECHNOLOGIES AND SERVICES INC]" <chintu.mistry@nasa.gov<mailto:chintu.mistry@nasa.gov>>,
"dev@oodt.apache.org<mailto:dev@oodt.apache.org>" <dev@oodt.apache.org<mailto:dev@oodt.apache.org>>
Subject: Re: OODT 0.3 branch

Hey Chintu,

Thanks for reaching out! Replies inline below:

From: <Mistry>, "Chintu [COLUMBUS TECHNOLOGIES AND SERVICES INC] (GSFC-586.0)" <chintu.mistry@nasa.gov<mailto:chintu.mistry@nasa.gov>>
Date: Tuesday, December 11, 2012 1:50 PM
To: "dev@oodt.apache.org<mailto:dev@oodt.apache.org>" <dev@oodt.apache.org<mailto:dev@oodt.apache.org>>
Cc: jpluser <chris.a.mattmann@jpl.nasa.gov<mailto:chris.a.mattmann@jpl.nasa.gov>>
Subject: OODT 0.3 branch

Hi Chris,

We are trying measure a performance of how fast filemanager+crawler is performing.

Here is what we are trying to do:

 *   Total data to process : 262GB
 *   3 file managers and 9 crawlers
 *    where 3 crawlers are sending file location to  file manager to process the file
 *   We have our own schema running on postgresql database
 *   Custom H5 Extactor using h5dump utility

Cool this sounds like an awesome test. Would you be willing to share some of the info on the
OODT wiki?

https://cwiki.apache.org/confluence/display/OODT/Home

Questions:
1) I have tried using FileUtils.copyFile vs FileUtils.moveFile, but I don't see any difference
in processing time. Both my LandingZone and Archive Area are located on same Filesystem(GPFS).
It is roughly taking 100 minutes to process 262G data. Can you shed any light on why don't
we see any performance change ?

This may have to do with the way that the JDK (what version are you using?) implements the
actual arraycopy methods, and how the apache commons-io library wraps those methods. Let me
know what JDK version you're using and we can investigate it.

- java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.5) (rhel-1.50.1.11.5.el6_3-x86_64)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

2) The other thing also is that I don't see any performance gain between running 2 FM or 3FM.
I thought that I would see some performance gain due to concurrency. Same goes for multiple
crawlers. I was hoping to see pretty obvious performance change if I increase number of crawlers.
What are thoughts on running things in parallel to increase performance.

How are you situating the additional file managers? Are you having 1 crawler ingest to 3?
Or is there a 1:1 correspondence between each crawler and FM? And, what do you mean by no
performance gain? Do you mean that you don't see 3x speed in terms of e.g. Product ingestion
of met into the catalog? Of file transfer speed?

- All 3 FM are running on one machine. Each crawler instance is crawling different directory.
And 3 Crawlers are connected to 1st FM. Other 3 are connected to second FM and last 3 crawlers
are connected to third FM. When I say performance difference between 2 and 3FM, I meant they
take identically same amount of time to process same amount of data concurrently. I would
love to see 3x speed if I run 3FM. I was talking about the whole ingest process from start
to end for one file, which involves extracting metadata, inserting records into database and
transferring file to archive location.

Are the 3 crawlers crawling the same staging area concurrently? Or are they separated out
by buckets? And, which crawler are you using? The MetExtractorProductCrawler or the AutoDetectCrawler?
Also, what is the versioning policy for the FM on a per product basis? Are all products being
ingested of the same ProductType and ultimately of the same versioner and ultimate disk location?

- We are using StdProductCrawler. We don't have versioning requirement. Products are of different
ProductTypes. We are trying to process 1 orbit full of data. They all get archived at "ARCHIVE_BASE/{ProductType}/YYYYMMDD"
location.

3) Like I said earlier, we are running crawler to push data to file manager. If I run it that
way, then "data transfer(copy or move)" is happing on the crawler side. I can not find any
way to let file manager handle "data transfer" using on of your runtime options. Please let
me know if you guys know how to do that ?

If you want the FM to handle the transfer you have to use the low level File Manager Client
and omit the clientTransfer option:

[chipotle:local/filemgr/bin] mattmann% ./filemgr-client
filemgr-client --url <url to xml rpc service> --operation [<operation> [params]]
operations:
--addProductType --typeName <name> --typeDesc <description> --repository <path>
--versionClass <classname of versioning impl>
--ingestProduct --productName <name> --productStructure <Hierarchical|Flat> --productTypeName
<name of product type> --metadataFile <file> [--clientTransfer --dataTransfer
<java class name of data transfer factory>] --refs <ref1>...<refn>
--hasProduct --productName <name>
--getProductTypeByName --productTypeName <name>
--getNumProducts --productTypeName <name>
--getFirstPage --productTypeName <name>
--getNextPage --productTypeName <name> --currentPageNum <number>
--getPrevPage --productTypeName <name> --currentPageNum <number>
--getLastPage --productTypeName <name>
--getCurrentTransfer
--getCurrentTransfers
--getProductPctTransferred --productId <id> --productTypeName <name>
--getFilePctTransferred --origRef <uri>

[chipotle:local/filemgr/bin] mattmann%

That is just a CMD line exposure of the underlying FM client Java API which lets you do server
side transfers on ingest by passing clientTransfer == false to this method:

http://oodt.apache.org/components/maven/xref/org/apache/oodt/cas/filemgr/system/XmlRpcFileManagerClient.html#1168

- Fair enough. I was hoping to see any cmd line option in Crawler Launcher. No problem.

We have enough processing power to run multiple FM and Crawlers for scalability. But for some
reason crawler is not scaling enough.


We'll get it scaling out for ya. Can you please provide answers to the above questions and
we'll go from there? Thanks!

Thanks!

Cheers,
Chris




Regards
--
Chintu Mistry
NASA Goddard Space Flight Center
Bldg L40B, Room S776
Office: 240 684 0477
Mobile: 770 310 1047

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message