hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From phi...@free.fr
Subject Application to parse and index PDFs
Date Thu, 16 Jan 2014 12:03:12 GMT


I would like to develop an application which would index place and people names, dates, numbers
and monetary amounts, among other things, contained in thousands of PDFs. People and place
names would be looked up in gazeeters (ie, dictionaries) and dates, numbers and amounts would
be normalized so as to be comparable (eg, find all PDFs whose contents contain dates >
20010101 and < 20100101).

The indexed documents would then be sorted according to certain criteria, eg, document name
or date, so that searches made by users using a search engine (eg, SOLR), yield documents
ordered according to those criteria.

The index would then be transferred to a search engine.

Can anyone tell me if Hadoop would be useful in any part of the development process of this
application, eg, the index storage and sorting part?

Furthermore, are Hadoop crawlers, such as Nutch, good substitutes for parsing and indexing
tools such as GATE, UIMA, OpenNLP? For instance, can you do the above token recognition tasks
with such crawlers?

Many thanks.


View raw message