accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: bulk ingest without mapred
Date Tue, 08 Apr 2014 16:41:43 GMT
Paul,

It might be a good idea to re-read a basic overview on HDFS. You
shouldn't be modifying anything beneath the HDFS data directories.
Those directories on the local filesystem are used by HDFS to create a
distributed filesystem (which is what Accumulo is using).

Those paths that you provide to Accumulo for bulk imports all exist on
that distributed filesystem which should be modified using the hadoop
or hdfs executable on the command line or the FileSystem with your
hdfs-site.xml configuration file on the classpath.

On Tue, Apr 8, 2014 at 12:36 PM, pdread <paul.read@siginttech.com> wrote:
> My hdfs-site.xml has the data nodes (space?) defined as
>
> <property>
>         <name>dfs.data.dir</name>
>         <value>/data/accu1/hdfs,/data/accu2/hdfs</value>
> </property>
>
> So I created the files/directories under /data/accu1/hdfs/tmp/bulk, and so
> they were.
>
> After more exploring I found the Hadoop code that is causing the problem,
> DFSClient.getFileInfo() is returning null.
>
>  public FileStatus getFileInfo(String src) throws IOException {
>     FileStatus fileStatus;
>
>     checkOpen();
>     try {
>       if (fileStatusCache != null) {
>         fileStatus = fileStatusCache.get(src);
>         if (fileStatus != FileStatusCache.nullFileStatus) {
>           return fileStatus;
>         }
>       }
>       fileStatus = namenodeProtocolProxy == null ?
> versionBasedGetFileInfo(src)
>           : methodBasedGetFileInfo(src);
>    if (fileStatusCache != null) {
>    fileStatusCache.set(src, fileStatus);
>    }
>
>    return fileStatus;
>     } catch(RemoteException re) {
>       throw re.unwrapRemoteException(AccessControlException.class);
>     }
>   }
>
> So I guess now why is this the case. I noticed that no logging was done to
> the hadoop logs, specifically the namenode and datanode logs. The DFSClient
> code refers to rpc calls which would suggest its connection into the hadoop
> system and not looking at the disk directly. Since I used FileSystem to do
> the file manipulation is there additional bookkeeping that needs to be done
> to let the "hadoop" system know there are files out there? In other words
> even though I used hadoop to create the files does "hadoop" proper know
> about them? If not then what bookkeeping has to be done to get them into the
> system.
>
> Just a guess here. But since the files are clear there and clearly available
> there must be something else at play.
>
> Thanks
>
> Paul
>
>
>
> --
> View this message in context: http://apache-accumulo.1065345.n5.nabble.com/bulk-ingest-without-mapred-tp8904p8914.html
> Sent from the Users mailing list archive at Nabble.com.

Mime
View raw message