nifi-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NIFI-2417) Implement Query and Scroll processors for ElasticSearch
Date Wed, 21 Sep 2016 13:31:20 GMT

    [ https://issues.apache.org/jira/browse/NIFI-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15509908#comment-15509908
] 

ASF GitHub Bot commented on NIFI-2417:
--------------------------------------

Github user mattyb149 commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/733#discussion_r79832571
  
    --- Diff: nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/ScrollElasticsearchHttp.java
---
    @@ -0,0 +1,415 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.nifi.processors.elasticsearch;
    +
    +import java.io.IOException;
    +import java.net.MalformedURLException;
    +import java.net.URL;
    +import java.util.ArrayList;
    +import java.util.Collections;
    +import java.util.HashMap;
    +import java.util.HashSet;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Set;
    +import java.util.concurrent.TimeUnit;
    +import java.util.regex.Pattern;
    +import java.util.stream.Collectors;
    +import java.util.stream.Stream;
    +
    +import org.apache.commons.lang3.StringUtils;
    +import org.apache.nifi.annotation.behavior.EventDriven;
    +import org.apache.nifi.annotation.behavior.InputRequirement;
    +import org.apache.nifi.annotation.behavior.Stateful;
    +import org.apache.nifi.annotation.behavior.SupportsBatching;
    +import org.apache.nifi.annotation.behavior.WritesAttribute;
    +import org.apache.nifi.annotation.behavior.WritesAttributes;
    +import org.apache.nifi.annotation.documentation.CapabilityDescription;
    +import org.apache.nifi.annotation.documentation.Tags;
    +import org.apache.nifi.annotation.lifecycle.OnScheduled;
    +import org.apache.nifi.components.PropertyDescriptor;
    +import org.apache.nifi.components.state.Scope;
    +import org.apache.nifi.components.state.StateManager;
    +import org.apache.nifi.components.state.StateMap;
    +import org.apache.nifi.flowfile.FlowFile;
    +import org.apache.nifi.logging.ComponentLog;
    +import org.apache.nifi.processor.ProcessContext;
    +import org.apache.nifi.processor.ProcessSession;
    +import org.apache.nifi.processor.Relationship;
    +import org.apache.nifi.processor.exception.ProcessException;
    +import org.apache.nifi.processor.util.StandardValidators;
    +import org.apache.nifi.stream.io.ByteArrayInputStream;
    +import org.codehaus.jackson.JsonNode;
    +
    +import okhttp3.HttpUrl;
    +import okhttp3.OkHttpClient;
    +import okhttp3.Response;
    +import okhttp3.ResponseBody;
    +
    +@InputRequirement(InputRequirement.Requirement.INPUT_FORBIDDEN)
    +@EventDriven
    +@SupportsBatching
    +@Tags({ "elasticsearch", "query", "scroll", "read", "get", "http" })
    +@CapabilityDescription("Scrolls through an Elasticsearch query using the specified connection
properties. "
    +        + "This processor is intended to be run on the primary node, and is designed
for scrolling through "
    +        + "huge result sets, as in the case of a reindex.  The state must be cleared
before another query "
    +        + "can be run.  Each page of results is returned, wrapped in a JSON object like
so: { \"hits\" : [ <doc1>, <doc2>, <docn> ] }.  "
    +        + "Note that the full body of each page of documents will be read into memory
before being "
    +        + "written to a Flow File for transfer.")
    +@WritesAttributes({
    +        @WritesAttribute(attribute = "es.index", description = "The Elasticsearch index
containing the document"),
    +        @WritesAttribute(attribute = "es.type", description = "The Elasticsearch document
type") })
    +@Stateful(description = "After each successful scroll page, the latest scroll_id is persisted
in scrollId as input for the next scroll call.  "
    +        + "Once the entire query is complete, finishedQuery state will be set to true,
and the processor will not execute unless this is cleared.", scopes = { Scope.LOCAL })
    +public class ScrollElasticsearchHttp extends AbstractElasticsearchHttpProcessor {
    +
    +    private static final String FINISHED_QUERY_STATE = "finishedQuery";
    +    private static final String SCROLL_ID_STATE = "scrollId";
    +    private static final String FIELD_INCLUDE_QUERY_PARAM = "_source_include";
    +    private static final String QUERY_QUERY_PARAM = "q";
    +    private static final String SORT_QUERY_PARAM = "sort";
    +    private static final String SCROLL_QUERY_PARAM = "scroll";
    +    private static final String SCROLL_ID_QUERY_PARAM = "scroll_id";
    +    private static final String SIZE_QUERY_PARAM = "size";
    +
    +    public static final Relationship REL_SUCCESS = new Relationship.Builder()
    +            .name("success")
    +            .description(
    +                    "All FlowFiles that are read from Elasticsearch are routed to this
relationship.")
    +            .build();
    +
    +    public static final Relationship REL_FAILURE = new Relationship.Builder()
    +            .name("failure")
    +            .description(
    +                    "All FlowFiles that cannot be read from Elasticsearch are routed
to this relationship. Note that only incoming "
    +                            + "flow files will be routed to failure.").build();
    +
    +    public static final PropertyDescriptor QUERY = new PropertyDescriptor.Builder()
    +            .name("scroll-es-query").displayName("Query")
    +            .description("The Lucene-style query to run against ElasticSearch").required(true)
    --- End diff --
    
    It might be helpful to add a trivial example here. Also I couldn't use something like
(username:tiger) to find usernames that contain the word tiger, I had to use (username:\*tiger\*).
Again though, that might be my ES setup, just wanted to make you aware.


> Implement Query and Scroll processors for ElasticSearch
> -------------------------------------------------------
>
>                 Key: NIFI-2417
>                 URL: https://issues.apache.org/jira/browse/NIFI-2417
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>    Affects Versions: 1.0.0
>            Reporter: Joseph Gresock
>            Assignee: Joseph Gresock
>            Priority: Minor
>             Fix For: 1.1.0
>
>
> FetchElasticsearchHttp allows users to select a single document from Elasticsearch in
NiFi, but there is no way to run a query to retrieve multiple documents.
> We should add a QueryElasticsearchHttp processor for running a query and returning a
flow file per result, for small result sets.  This should allow both input and non-input execution.
 
> A separate ScrollElasticsearchHttp processor would also be useful for scrolling through
a huge result set.  This should use the state manager to maintain the scroll_id value, and
use this as input to the next scroll page.  As a result, this processor should not allow flow
file input, but should retrieve one page per run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message