lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furkan KAMACI <furkankam...@gmail.com>
Subject WikipediaTokenizer for Removing Unnecesary Parts
Date Tue, 23 Jul 2013 14:53:32 GMT
Hi;

I have indexed wikipedia data with Solr DIH. However when I look data that
is indexed at Solr I something like that as well:

{| style="text-align: left; width: 50%; table-layout: fixed;" border="0"
|- valign="top"
| style="width: 50%"|
:*[[Ubuntu]]
:*[[Fedora]]
:*[[Mandriva]]
:*[[Linux Mint]]
:*[[Debian]]
:*[[OpenSUSE]]
|
*[[Red Hat]]
*[[Mageia]]
*[[Arch Linux]]
*[[PCLinuxOS]]
*[[Slackware]]
|}

However I want to remove them before indexing. I know that there is a
WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as
like links, style, etc..) with Solr?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message