drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From paul-rogers <...@git.apache.org>
Subject [GitHub] drill pull request #574: DRILL-4726: Dynamic UDFs support
Date Mon, 26 Sep 2016 21:53:37 GMT
Github user paul-rogers commented on a diff in the pull request:

    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java
    @@ -301,29 +323,120 @@ private ScanResult scan(ClassLoader classLoader, Path path, URL[]
urls) throws I
             return RunTimeScan.dynamicPackageScan(drillConfig, Sets.newHashSet(urls));
    -    throw new FunctionValidationException(String.format("Marker file %s is missing in
    +    throw new JarValidationException(String.format("Marker file %s is missing in %s",
             CommonConstants.DRILL_JAR_MARKER_FILE_RESOURCE_PATHNAME, path.getName()));
    -  private static String getUdfDir() {
    -    return Preconditions.checkNotNull(System.getenv("DRILL_UDF_DIR"), "DRILL_UDF_DIR
variable is not set");
    +  /**
    +   * Return list of jars that are missing in local function registry
    +   * but present in remote function registry.
    +   *
    +   * @param remoteFunctionRegistry remote function registry
    +   * @param localFunctionRegistry local function registry
    +   * @return list of missing jars
    +   */
    +  private List<String> getMissingJars(RemoteFunctionRegistry remoteFunctionRegistry,
    +                                      LocalFunctionRegistry localFunctionRegistry) {
    +    List<Jar> remoteJars = remoteFunctionRegistry.getRegistry().getJarList();
    +    List<String> localJars = localFunctionRegistry.getAllJarNames();
    +    List<String> missingJars = Lists.newArrayList();
    +    for (Jar jar : remoteJars) {
    +      if (!localJars.contains(jar.getName())) {
    +        missingJars.add(jar.getName());
    +      }
    +    }
    +    return missingJars;
    +  }
    +  /**
    +   * Creates local udf directory, if it doesn't exist.
    +   * Checks if local is a directory and if current application has write rights on it.
    +   * Attempts to clean up local idf directory in case jars were left after previous drillbit
    +   *
    +   * @return path to local udf directory
    +   */
    +  private Path getLocalUdfDir() {
    +    String confDir = getConfDir();
    --- End diff --
    Unfortunately, this won't work in the case of Drill-on-YARN. The $DRILL_HOME and $DRILL_CONF_DIR
directories are read-only in that case.
    The new site directory (pointed to by DRILL_CONF_DIR) will contain a "jars" directory
that contains statically-defined UDFs. In Drill-on-YARN, YARN copies all of the site directory
to the local machine, but makes it read-only so that YARN can reuse that same "localized"
copy for multiple runs. (That feature is handy fo map/reduce, but is not that useful for Drill.
Still, that's how YARN works...)
    One solution: provide a config option that specifies the local UDF location. The Apache
Drill default can be the config dir (assuming there is a way to reference the config dir from
within drill-override.conf -- need to check that.) For DoY, we will change the location to
be a temp directory location provided by YARN.
    Using the YARN temp directory ensures that the local udf dir starts out empty on each
run. But, what about the "stock" Drill case? The $DRILL_CONFIG_DIR/udf directory probably
will contain jars from a previous run. Is this desired? Does the code handle this case? Do
we clean out UDFs that were dropped while the Drillbit was offline? Do we handle a partially-downloaded
jar that was left incomplete when the previous run crashed?
    Or, would it be better to clear the udf directory on the start of each Drill run? If we
do that, can we always write udfs to a temp directory? Perhaps review the temp directories
    Since DoY defines the temp directory at runtime, we need to set the temp diretory in drill-config.sh
(which you did in a previous version.) As it turns out, Drill already has temp directories
set in the config system (for spill-to-disk.) So we need to reconcile these two.
    Perhaps this:
    Define DRILL_TEMP_DIR in drill-config.sh. If it is set in the environment (the DoY case)
or drill-env.sh (the non-DoY case), use it. Else, default to /tmp.
    Under DoY, we can run multiple drillbits on the same host (by changing ports, etc.) So
we need a unique path. Define the actual Drillbit temp directory to be
    drillbit-temp-dir = $DRILL_TEMP_DIR/${drill-root}-${cluster-id}
    We need both the root and cluster ID because neither is unique by itself, unfortunately.
    Finally, udfs can reside in ${drillbit-temp-dir}/udf
    This is just one possibility to illustrate the issue. Feel free to create a better solution.

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.

View raw message