nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Austin Heyne <ahe...@ccri.com>
Subject Re: GetHDFS from Azure Blob
Date Wed, 29 Mar 2017 16:37:47 GMT
For the record,

The way we figured out to fix this is to create a new XML file for each 
root level container that we use (tentatively fs.xml). This fs.xml looks 
like the following:

<configuration>
     <property>
       <name>fs.defaultFS</name>
<value>wasb://container@accountName.blob.core.windows.net/</value>
     </property>
</configuration>

We then include the core-site.xml, hdfs-site.xml and fs.xml in the 
'Hadoop Configuration Resources' path ensuring the fs.xml comes last. 
This will overwrite the fs.defaultFS value set in core-site.xml.

Thanks everyone for the help,
Austin

On 03/28/2017 06:11 PM, Austin Heyne wrote:
> Thanks Bryan,
>
> We're only working with one account here but with multiple root level 
> containers. e.g.
>
> wasb://csv@accountName.blob.core.windows.net/
> wasb://xml@accountName.blob.core.windows.net/
> wasb://json@accountName.blob.core.windows.net/
>
> The thing that stands out to me the most is why would the defaultFS 
> need to be set at all if we're always providing complete wasb://... 
> paths? Almost seems like a bug or oversight.
>
> If anyone has any input on how we could work around this please let me 
> know.
>
> Thanks for your help,
> Austin
>
> On 03/28/2017 04:39 PM, Bryan Bende wrote:
>> Austin,
>>
>> I think you are correct that its <containername>@<accountname>, I
>> hadn't looked at this config in a long time and was reading too
>> quickly before :)
>>
>> That would line up with the other property
>> fs.azure.account.key.<accountname>.blob.core.windows.net where you
>> specify the key for that account.
>>
>> I have no idea if this will work, but lets say you had three different
>> WASB file systems, presumably each with their own account name and
>> key, you might be able to define these in core-site.xml:
>>
>>   <property>
>> <name>fs.azure.account.key.ACCOUNT1.blob.core.windows.net</name>
>>        <value>KEY1</value>
>>      </property>
>>
>>   <property>
>> <name>fs.azure.account.key.ACCOUNT2.blob.core.windows.net</name>
>>        <value>KEY2</value>
>>      </property>
>>
>>   <property>
>> <name>fs.azure.account.key.ACCOUNT3.blob.core.windows.net</name>
>>        <value>KEY3</value>
>>      </property>
>>
>> Then in your HDFS processor in NiFi you point at this core-site.xml
>> and use a specific directory like
>> wasb://container@ACCOUNT3.blob.core.windows.net/<path> and I'm hoping
>> it would know how to use the key for ACCOUNT3.
>>
>> Not really sure if that helps your situation.
>>
>> -Bryan
>>
>>
>> On Tue, Mar 28, 2017 at 4:14 PM, Austin Heyne <aheyne@ccri.com> wrote:
>>> Bryan,
>>>
>>> So I initially didn't think much of it (assumed it a typo, etc) but 
>>> you've
>>> said that the access url for wasb that you've been using is
>>> wasb://YOUR_USER@YOUR_HOST/. However, this has never worked for us 
>>> and I'm
>>> wondering if we have a difference configuration somewhere. What we 
>>> have to
>>> use is 
>>> wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>
>>> which seems to be in line with the Azure blob storage GUI and is 
>>> what is
>>> outlined here [1]. Is there some other way this connector is being 
>>> setup? It
>>> would make much more sense using your access pattern as then each 
>>> container
>>> wouldn't need to have it's own core-site.xml.
>>>
>>> Thanks,
>>> Austin
>>>
>>> [1a]
>>> https://hadoop.apache.org/docs/current/hadoop-azure/index.html#Accessing_wasb_URLs

>>>
>>> [1b]
>>> https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage

>>>
>>>
>>>
>>>
>>>
>>> On 03/28/2017 03:55 PM, Bryan Bende wrote:
>>>> Austin,
>>>>
>>>> I believe the default FS is only used when you write to a path that
>>>> doesn't specify the filesystem. Meaning, if you set the directory of
>>>> PutHDFS to /data then it will use the default FS, but if you specify
>>>> wasb://user@wasb2/data then it will go to /data in a different
>>>> filesystem.
>>>>
>>>> The problem here is that I don't see a way to specify different keys
>>>> for each WASB filesystem in the core-site.xml.
>>>>
>>>> Admittedly I have never tried to setup something like this with many
>>>> different filesystems.
>>>>
>>>> -Bryan
>>>>
>>>>
>>>> On Tue, Mar 28, 2017 at 3:50 PM, Austin Heyne <aheyne@ccri.com> wrote:
>>>>> Hi Andre,
>>>>>
>>>>> Yes, I'm aware of that configuration property, it's what I have been
>>>>> using
>>>>> to set the core-site.xml and hdfs-site.xml. For testing this I didn't
>>>>> modify
>>>>> the core-site located in the HADOOP_CONF_DIR but rather copied and
>>>>> modified
>>>>> it and the pointed the processor to the copy. The problem with 
>>>>> this is
>>>>> that
>>>>> we'll end up with a large number of core-site.xml copies that will 
>>>>> all
>>>>> have
>>>>> to be maintained separately. Ideally we'd be able to specify the
>>>>> defaultFS
>>>>> in the processor config or have the processor behave like the hdfs
>>>>> command
>>>>> line tools. The command line tools don't require the defaultFS to 
>>>>> be set
>>>>> to
>>>>> a wasb url in order to use wasb urls.
>>>>>
>>>>> The key idea here is long term maintainability and using Ambari to
>>>>> maintain
>>>>> the configuration. If we need to change any other setting in the
>>>>> core-site.xml we'd have to change it in a bunch of different files
>>>>> manually.
>>>>>
>>>>> Thanks,
>>>>> Austin
>>>>>
>>>>>
>>>>> On 03/28/2017 03:34 PM, Andre wrote:
>>>>>
>>>>> Austin,
>>>>>
>>>>> Perhaps that wasn't explicit but the settings don't need to be system
>>>>> wide,
>>>>> instead the defaultFS may be changed just for a particular processor,
>>>>> while
>>>>> the others may use configurations.
>>>>>
>>>>> The *HDFS processor documentation mentions it allows yout to set
>>>>> particular
>>>>> hadoop configurations:
>>>>>
>>>>> " A file or comma separated list of files which contains the 
>>>>> Hadoop file
>>>>> system configuration. Without this, Hadoop will search the 
>>>>> classpath for
>>>>> a
>>>>> 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default
>>>>> configuration"
>>>>>
>>>>> Have you tried using this field to point to a file as described by 
>>>>> Bryan?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 29 Mar 2017 05:21, "Austin Heyne" <aheyne@ccri.com> wrote:
>>>>>
>>>>> Thanks Bryan,
>>>>>
>>>>> Working with the configuration you sent what I needed to change 
>>>>> was to
>>>>> set
>>>>> the fs.defaultFS to the wasb url that we're working from. 
>>>>> Unfortunately
>>>>> this
>>>>> is a less than ideal solution since we'll be pulling files from 
>>>>> multiple
>>>>> wasb urls and ingesting them into an Accumulo datastore. Changing the
>>>>> defaultFS I'm pretty certainly would mess with our local 
>>>>> HDFS/Accumulo
>>>>> install. In addition we're trying to maintain all of this 
>>>>> configuration
>>>>> with
>>>>> Ambari, which from what I can tell only supports one core-site
>>>>> configuration
>>>>> file.
>>>>>
>>>>> Is the only solution here to maintain multiple core-site.xml files 
>>>>> or is
>>>>> there another way we configure this?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Austin
>>>>>
>>>>>
>>>>>
>>>>> On 03/28/2017 01:41 PM, Bryan Bende wrote:
>>>>>> Austin,
>>>>>>
>>>>>> Can you provide the full error message and stacktrace for  the
>>>>>> IllegalArgumentException from nifi-app.log?
>>>>>>
>>>>>> When you start the processor it creates a FileSystem instance 
>>>>>> based on
>>>>>> the config files provided to the processor, which in turn causes
all
>>>>>> of the corresponding classes to load.
>>>>>>
>>>>>> I'm not that familiar with Azure, but if "Azure blob store" is WASB,
>>>>>> then I have successfully done the following...
>>>>>>
>>>>>> In core-site.xml:
>>>>>>
>>>>>> <configuration>
>>>>>>
>>>>>>        <property>
>>>>>>          <name>fs.defaultFS</name>
>>>>>> <value>wasb://YOUR_USER@YOUR_HOST/</value>
>>>>>>        </property>
>>>>>>
>>>>>>        <property>
>>>>>> <name>fs.azure.account.key.nifi.blob.core.windows.net</name>
>>>>>>          <value>YOUR_KEY</value>
>>>>>>        </property>
>>>>>>
>>>>>>        <property>
>>>>>> <name>fs.AbstractFileSystem.wasb.impl</name>
>>>>>> <value>org.apache.hadoop.fs.azure.Wasb</value>
>>>>>>        </property>
>>>>>>
>>>>>>        <property>
>>>>>>          <name>fs.wasb.impl</name>
>>>>>> <value>org.apache.hadoop.fs.azure.NativeAzureFileSystem</value>
>>>>>>        </property>
>>>>>>
>>>>>>        <property>
>>>>>>          <name>fs.azure.skip.metrics</name>
>>>>>>          <value>true</value>
>>>>>>        </property>
>>>>>>
>>>>>> </configuration>
>>>>>>
>>>>>> In Additional Resources property of an HDFS processor, point to a
>>>>>> directory with:
>>>>>>
>>>>>> azure-storage-2.0.0.jar
>>>>>> commons-codec-1.6.jar
>>>>>> commons-lang3-3.3.2.jar
>>>>>> commons-logging-1.1.1.jar
>>>>>> guava-11.0.2.jar
>>>>>> hadoop-azure-2.7.3.jar
>>>>>> httpclient-4.2.5.jar
>>>>>> httpcore-4.2.4.jar
>>>>>> jackson-core-2.2.3.jar
>>>>>> jsr305-1.3.9.jar
>>>>>> slf4j-api-1.7.5.jar
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Bryan
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 28, 2017 at 1:15 PM, Austin Heyne <aheyne@ccri.com>

>>>>>> wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Thanks for all the help you've given me so far. Today I'm trying
to
>>>>>>> pull
>>>>>>> files from an Azure blob store. I've done some reading on this

>>>>>>> and from
>>>>>>> previous tickets [1] and guides [2] it seems the recommended

>>>>>>> approach
>>>>>>> is
>>>>>>> to
>>>>>>> place the required jars, to use the HDFS Azure protocol, in 
>>>>>>> 'Additional
>>>>>>> Classpath Resoures' and the hadoop core-site and hdfs-site 
>>>>>>> configs into
>>>>>>> the
>>>>>>> 'Hadoop Configuration Resources'. I have my local HDFS properly
>>>>>>> configured
>>>>>>> to access wasb urls. I'm able to ls, copy to and from, etc with
out
>>>>>>> problem.
>>>>>>> Using the same HDFS config files and trying both all the jars
in my
>>>>>>> hadoop-client/lib directory (hdp) and using the jars recommend

>>>>>>> in [1]
>>>>>>> I'm
>>>>>>> still seeing the "java.lang.IllegalArgumentException: Wrong FS:

>>>>>>> " error
>>>>>>> in
>>>>>>> my NiFi logs and am unable to pull files from Azure blob storage.
>>>>>>>
>>>>>>> Interestingly, it seems the processor is spinning up way to 
>>>>>>> fast, the
>>>>>>> errors
>>>>>>> appear in the log as soon as I start the processor. I'm not sure

>>>>>>> how it
>>>>>>> could be loading all of those jars that quickly.
>>>>>>>
>>>>>>> Does anyone have any experience with this or recommendations
to 
>>>>>>> try?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Austin
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/NIFI-1922
>>>>>>> [2]
>>>>>>>
>>>>>>>
>>>>>>> https://community.hortonworks.com/articles/71916/connecting-to-azure-data-lake-from-a-nifi-dataflow.html

>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>


Mime
View raw message