hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco Didonna <m.didonn...@gmail.com>
Subject Distributed indexing with Hadoop
Date Fri, 28 Jan 2011 10:49:56 GMT
Hello everyone,
I am building an hadoop "app" to quickly index a corpus of documents.
This app will accept one or more XML file that will contain the corpus.
Each document is made up of several section: title, authors,
body...these section are not static and depend on the collection. Here's
a sample glimpse of how the xml input file looks like:

<document id='1'>
<field name='title'> the divine comedy </field>
<field name='author'>Dante</field>
<field name='body'>halfway along our life's path.......</field>
</document>
<document id='2'>

...

</document>

I would like to discuss some implementation choices:

- which is the best way to "tell" my hadoop app which section to expect
between <document> and </document> tags?

- is it more appropriate to implement a record reader that passes to the
mapper the whole content of the document tag or section by section. I
was wondering which parser to use, a dom-like one or a sax-like
one...any library (efficient) to recommend?

- do you know any library I could use to process text? By text
processing I mean common preprocessing operation like tokenization,
stopword elimination...I was thinking of using lucene's engine...can it
be a bottleneck?

I am looking forward to read your opinion

Thanks,

Marco


Mime
View raw message