lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reyna Melara <reynamel...@gmail.com>
Subject is it possible to index wiki markup files?
Date Wed, 11 Jan 2012 19:13:19 GMT
Hi, my name is Reyna Melara I'm a PhD student form Mexico, and I have a set
of 11,051,447 files with txt extension but the content of each file is in
fact in wiki format, I want and I need them to be indexed, but I don't know
if I have to convert this content to flat text, I have been reading and I
have found that:

"At the core of Lucene's logical architecture is the idea of a *document*
 containing *fields* of text. This flexibility allows Lucene's API to be
independent of the file format <http://en.wikipedia.org/wiki/File_format>.
Text from PDFs <http://en.wikipedia.org/wiki/Portable_Document_Format>,
HTML<http://en.wikipedia.org/wiki/HTML>
, Microsoft Word <http://en.wikipedia.org/wiki/Microsoft_Word>, and
OpenDocument <http://en.wikipedia.org/wiki/OpenDocument> documents, as well
as many others (except images), can all be indexed as long as their textual
information can be extracted."

So, I guess there's no problem if I leave the files just like they are
already.

My question about would be: Do I get the same results and advantages of
this files? Will it be good?

Thanks a lot, send best regards.


-- 
Reyna

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message