lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Brusic <i...@brusic.com>
Subject Re: is it possible to index wiki markup files?
Date Wed, 11 Jan 2012 19:28:28 GMT
Hi Reyna,

I have never used it, but there is a WikipediaTokenizer defined in the
analyzer contrib:

http://lucene.apache.org/java/3_5_0/api/contrib-analyzers/org/apache/lucene/analysis/wikipedia/WikipediaTokenizer.html

You can find a test case for this tokenizer in the source code.
Hopefully others will have been suggestions.

Cheers,

Ivan

On Wed, Jan 11, 2012 at 11:13 AM, Reyna Melara <reynamelara@gmail.com> wrote:
> Hi, my name is Reyna Melara I'm a PhD student form Mexico, and I have a set
> of 11,051,447 files with txt extension but the content of each file is in
> fact in wiki format, I want and I need them to be indexed, but I don't know
> if I have to convert this content to flat text, I have been reading and I
> have found that:
>
> "At the core of Lucene's logical architecture is the idea of a *document*
>  containing *fields* of text. This flexibility allows Lucene's API to be
> independent of the file format <http://en.wikipedia.org/wiki/File_format>.
> Text from PDFs <http://en.wikipedia.org/wiki/Portable_Document_Format>,
> HTML<http://en.wikipedia.org/wiki/HTML>
> , Microsoft Word <http://en.wikipedia.org/wiki/Microsoft_Word>, and
> OpenDocument <http://en.wikipedia.org/wiki/OpenDocument> documents, as well
> as many others (except images), can all be indexed as long as their textual
> information can be extracted."
>
> So, I guess there's no problem if I leave the files just like they are
> already.
>
> My question about would be: Do I get the same results and advantages of
> this files? Will it be good?
>
> Thanks a lot, send best regards.
>
>
> --
> Reyna

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message