mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bernd Mathiske" <be...@mesosphere.io>
Subject Re: Review Request 30774: Fetcher Cache
Date Thu, 19 Mar 2015 06:04:33 GMT


> On March 18, 2015, 11:05 a.m., Timothy Chen wrote:
> > src/slave/containerizer/fetcher.cpp, line 406
> > <https://reviews.apache.org/r/30774/diff/37/?file=897704#file897704line406>
> >
> >     Do we call fetch even if we don't have anything to fetch? I think it will be
a good idea to have a fast return if there is nothing to be fetched.

There is a check for this in Fetcher::fetch(). No need to even dispatch the call to the process
either if there is nothing to fetch.


> On March 18, 2015, 11:05 a.m., Timothy Chen wrote:
> > src/slave/containerizer/fetcher.cpp, line 491
> > <https://reviews.apache.org/r/30774/diff/37/?file=897704#file897704line491>
> >
> >     Why not just mock _fetch and do a barrier on it by giving it a promise in test?

"just mock _fetch" is more work and harder to understand.

It would also function, but then you would need to touch test code every time you change _fetch().
Furthermore, it would not be as clear why we wait for this particular call.


> On March 18, 2015, 11:05 a.m., Timothy Chen wrote:
> > src/slave/containerizer/fetcher.cpp, line 503
> > <https://reviews.apache.org/r/30774/diff/37/?file=897704#file897704line503>
> >
> >     Since this is only called in one place, how about put this in ___fetch, pass
it the future and check if it failed log it there?

How would this be simpler and more readable?

What is wrong with abstracting functions that are called only once? Doing so saves a comment
/ pulls what would have been a comment into code!


> On March 18, 2015, 11:05 a.m., Timothy Chen wrote:
> > src/slave/containerizer/fetcher.cpp, line 518
> > <https://reviews.apache.org/r/30774/diff/37/?file=897704#file897704line518>
> >
> >     In what scenario should a cache entry not exist?
> >     If it doesn't somehow we won't be able to use it too?

As you can see at the call sites, this method is used in scenarios where fetching succeeded,
where it failed, and incidentally where it left a partial download lying around. I added this
comment:

  // We may or may not have started downloading. The download may or may
  // not have been partial. In any case, clean up whatever is there.
  
If there is no file, that's fine. Then we tried fetching and failed before starting to write
the file. 

In any case, we remove the cache entry and the space amount it had reserved/claimed is released
for later use.


> On March 18, 2015, 11:05 a.m., Timothy Chen wrote:
> > src/slave/containerizer/fetcher.cpp, line 521
> > <https://reviews.apache.org/r/30774/diff/37/?file=897704#file897704line521>
> >
> >     Feel like this can be in a infintie loop, where if we can expire one item then
forever other fetch items will get stuck?
> >     I wonder if we should have some remedy action, or simply crash too?

This is not a loop, because the cache entry gets removed BEFORE we attempt to delete the file.
See line 500 just above.

However, just in case future changed code were ever to call this method several times on the
same entry, I added a line that sets the entry's size field to zero. This way, accounted cache
space is only released once.


> On March 18, 2015, 11:05 a.m., Timothy Chen wrote:
> > src/slave/containerizer/fetcher.cpp, line 726
> > <https://reviews.apache.org/r/30774/diff/37/?file=897704#file897704line726>
> >
> >     Why ignore error?

The code that follows this line as of line 712 handles the error case.


- Bernd


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30774/#review76902
-----------------------------------------------------------


On March 17, 2015, 6:59 a.m., Bernd Mathiske wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30774/
> -----------------------------------------------------------
> 
> (Updated March 17, 2015, 6:59 a.m.)
> 
> 
> Review request for mesos, Adam B, Benjamin Hindman, Till Toenshoff, and Timothy Chen.
> 
> 
> Bugs: MESOS-2057, MESOS-2069, MESOS-2070, MESOS-2072, MESOS-2073, and MESOS-2074
>     https://issues.apache.org/jira/browse/MESOS-2057
>     https://issues.apache.org/jira/browse/MESOS-2069
>     https://issues.apache.org/jira/browse/MESOS-2070
>     https://issues.apache.org/jira/browse/MESOS-2072
>     https://issues.apache.org/jira/browse/MESOS-2073
>     https://issues.apache.org/jira/browse/MESOS-2074
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Almost all of the functionality in epic MESOS-336. Downloaded files from CommandInfo::URIs
can now be cached in a cache directory designated by a slave flag. This only happens when
asked for by an extra flag in the URI and is thus backwards-compatible. The cache has a size
limit also given by a new slave flag. Cache-resident files are evicted as necessary to make
space for newly fetched ones. Concurrent attempts to cache the same URI leads to only one
download. The fetcher program remains external for safety reasons, but is now augmented with
more elaborate parameters packed into a JSON object to implement specific fetch actions for
all of the above. Additional testing includes fetching from (mock) HDFS and coverage of the
new features.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md 7119b1421ac1506fa118e9f91d07e027dec3d92e 
>   docs/fetcher-cache-internals.md PRE-CREATION 
>   docs/fetcher.md PRE-CREATION 
>   include/mesos/fetcher/fetcher.proto 311af9aebc6a85dadba9dbeffcf7036b70896bcc 
>   include/mesos/mesos.proto ec8efaec13f54a56d82411f6cdbdb8ad8b103748 
>   src/Makefile.am 7a06c7028eca8164b1f5fdea6a7ecd37ee6826bb 
>   src/hdfs/hdfs.hpp 968545d9af896f3e72e156484cc58135405cef6b 
>   src/launcher/fetcher.cpp 796526f59c25898ef6db2b828b0e2bb7b172ba25 
>   src/slave/constants.hpp fd1c1aba0aa62372ab399bee5709ce81b8e92cec 
>   src/slave/containerizer/docker.hpp b7bf54ac65d6c61622e485ac253513eaac2e4f88 
>   src/slave/containerizer/docker.cpp 5f4b4ce49a9523e4743e5c79da4050e6f9e29ed7 
>   src/slave/containerizer/fetcher.hpp 1db0eaf002c8d0eaf4e0391858e61e0912b35829 
>   src/slave/containerizer/fetcher.cpp 9e9e9d0eb6b0801d53dec3baea32a4cd4acdd5e2 
>   src/slave/containerizer/mesos/containerizer.hpp ae61a0fcd19f2ba808624312401f020121baf5d4

>   src/slave/containerizer/mesos/containerizer.cpp fbd1c0a0e5f4f227adb022f0baaa6d2c7e3ad748

>   src/slave/flags.hpp dbaf5f532d0bc65a6d16856b8ffcc2c06a98f1fa 
>   src/slave/slave.cpp 0f99e4efb8fa2b96f120a3e49191158ca0364c06 
>   src/tests/docker_containerizer_tests.cpp 06cd3d89ecbaaac17ae6970604b21fbe29f6e887 
>   src/tests/fetcher_cache_tests.cpp PRE-CREATION 
>   src/tests/fetcher_tests.cpp 4549e6a631e2c17cec3766efaa556593eeac9a1e 
>   src/tests/mesos.hpp 45e35204d1aa876fa0c871acf0f21afcd5ababe8 
>   src/tests/mesos.cpp c8f43d21b214e75eaac2870cbdf4f03fd18707d1 
> 
> Diff: https://reviews.apache.org/r/30774/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> --- longer Description: ---
> 
> -Replaces all other reviews for the fetcher cache except those related to stout: 30006,
30033, 30034, 30036, 30037, 30039, 30124, 30173, 30614, 30616, 30618, 30621, 30626. See descriptions
of those. In dependency order:
> 
> 30033: Removes the fetcher env tests since these won't be needed any more when the fetcher
uses JSON in a single env var as a parameter. They never tested anything that won't be covered
by other tests anyway.
> 
> 30034: Makes the code structure of all fetcher tests the same. Instead of calling the
run method of the fetcher directly, calling through fetch(). Also removes all uses of I/O
redirection, which is not really needed for debugging, and thus the next patch can refactor
fetch() and run(). (The latter comes in two varieties, which complicates matters without much
benefit.)
> 
> 30036: Extends the CommandInfo::URI protobuf with a boolean "caching" field that will
later cause fetcher cache actions. Also introduces the notion of a cache directory to the
fetcher info protobuf. And then propagates these additions throughout the rest of the code
base where applicable. This includes passing the slave ID all the way down to the place where
the cache dir name is constructed.
> 
> 30037: Extends the fetcher info protobuf with "actions" (fetch directly bypassing the
cache, fetch through the cache, retrieve from the cache). Switches the basis for dealing with
uris to "items", which contain the uri, the action, and potentially a cache file name. Refactors
fetch() and run(), so there is only one of each. Introduces about half of the actual cache
logic, including a hashmap of cache file objects for bookkeeping and basic operations on it.

> 
> 30039: Enables fetcher cache actions in the mesos fetcher program.
> 
> 30006: Enables concurrent downloading into the fetcher cache. Reuse of download results
in the cache when multiple fetcher runs occur concurrently. 
> 
> 30614: This is to ensure that all this refactoring of fetcher code has not broken HDFS
fetching. Adds a test that exercises the C++ code paths in Mesos and mesos-fetcher related
to fetching from HDFS. Uses a mock HDFS client written in bash that acts just like a real
"hadoop" command if used in the right limited way.
> 
> 30124: Inserted fetcher cache zap upon slave startup, recovery and shutdown. This implements
recovery in an acceptable, yet most simple way.
> 
> 30173: Created fetcher cache tests. Adds a new test source file containing a test fixture
and tests to find out if the fetcher cache works with a variety of settings.
> 
> 30616: Adds hdfs::du() which calls "hadoop fs -du -h" and returns a string that contains
the file size for the URI passed as argument. This is needed to determine the size of a file
on HDFS before downloading it to the fetcher cache (to ensure there is enough space).
> 
> 30621: Refactored URI type separation in mesos-fetcher. Moved the URI type separation
code (distinguishes http, hdfs, local copying, etc.) from mesos-fetcher to the fetcher process/actor,
since it is going to be reused by download size queries when we introduce fetcher cache management.
Also factored out URI validation, which will be used the same way by mesos-fetcher and the
fetcher process/actor.
> 
> 30626: Fetcher cache eviction. This happens when the cache does not have enough space
to accomodate upcoming downloads to the cache. Necessary provisions included here:
> - mesos-fetcher does not run until evictions have been successful
> - Cache space is reserved while (async) waiting for eviction to succeed. If it fails,
the reservation gets undone.
> - Reservations can be partly from available space, partly from evictions. All math included
:-)
> - To find out how much space is needed, downloading has a prelude in which we query the
download size from the URI. This works for all URI types that mesos-fetcher currently supports,
including http and hdfs.
> - Size-determination requests are now synchronized, too. Only one per URI in play happens.
> - There is cleanup code for all kinds of error situations. At the very end of the fetch
attempt, each list is processed for undoing things like space reservations and eviction disabling.
> - Eviction gets disabled for URIs that are currently in use, i.e. the related cache files
are. We use reference counting for this, since there may be concurrent fetch attempts using
the same cache files.
> 
> 
> Thanks,
> 
> Bernd Mathiske
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message