hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mich Talebzadeh" <>
Subject RE: File search by hashes in Hadoop
Date Thu, 21 Jan 2016 22:09:21 GMT
Thanks Ritesh



I see there are two options here


1.    Use UNIX like commands on hdfs to find the relevant files

hdfs dfs -ls -R |grep sales

drwxr-xr-x   - hduser supergroup          0 2015-12-27 06:02 sales

-rw-r--r--   2 hduser supergroup          0 2015-12-27 06:02 sales/_SUCCESS


2.    Index based searching using Apache Lucene.


a.    Download Apache Lucene. For example lucene-5.4.0.gz. gunzip it, move it to lucene-5.4.0.tar
and untar it. 2 minutes job

b.  Create LUCENE_HOME somewhere where you untarred the files --> export LUCENE_HOME=/usr/lib/lucene

c.     Make sure that your CLASSPATH has the following jar files

d.  CLASSPATH=$CLASSPATH:${LUCENE_HOME}/core/lucene-core-5.4.0.jar:${LUCENE_HOME}/demo/lucene-demo-5.4.0.jar:${LUCENE_HOME}/analysis/common/lucene-analyzers-common-5.4.0.jar:${LUCENE_HOME}/queryparser/lucene-queryparser-5.4.0.jar

e.    Create an index for the directory you want to search. In my case $HADOOP_HOME/etc/Hadoop.
When you run the java code below, you will see a directory called index created where you
ran the command

f.  java -cp $CLASSPATH org.apache.lucene.demo.IndexFiles -docs $HADOOP_HOME/etc/hadoop

g.    Then you can conduct search in index directory. For example I am looking for word ‘yarn’

h.  java -cp $CLASSPATH org.apache.lucene.demo.SearchFiles

Enter query:


Searching for: yarn

9 total matching documents

1. /home/hduser/hadoop-2.6.0/etc/hadoop/keep/mapred-site.xml

2. /home/hduser/hadoop-2.6.0/etc/hadoop/yarn-env.cmd

3. /home/hduser/hadoop-2.6.0/etc/hadoop/

4. /home/hduser/hadoop-2.6.0/etc/hadoop/mapred-site.xml_ok

5. /home/hduser/hadoop-2.6.0/etc/hadoop/yarn-site.xml

6. /home/hduser/hadoop-2.6.0/etc/hadoop/mapred-site.xml

7. /home/hduser/hadoop-2.6.0/etc/hadoop/mapred-site.xml_pre

8. /home/hduser/hadoop-2.6.0/etc/hadoop/hadoop-policy.xml

9. /home/hduser/hadoop-2.6.0/etc/hadoop/

Press (q)uit or enter number to jump to a page.


Pretty useful



Dr Mich Talebzadeh




Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly <> 


NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Technology
Ltd, its subsidiaries nor their employees accept any responsibility.


From: Ritesh Kumar Singh [] 
Sent: 21 January 2016 17:10
Subject: Re: File search by hashes in Hadoop


Yes, it's possible to do both

1. Index based searching :

2. Wildcard based / Expression based searching :



Ritesh Kumar Singh,



On Thu, Jan 21, 2016 at 4:15 PM, Mich Talebzadeh < <>
> wrote:

Hi all,


Apologies for the nature of this question. 


Someone asked me whether it is possible to perform file search by hashes in Hadoop.


I am thinking that he means wildcard searches in HDFS?


Anyone has ideas what file search by hash means in Hadoop?







View raw message