lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sashidhar Guntury <sashidhar.mo...@gmail.com>
Subject problem with wikipedia tokenizer
Date Tue, 19 Mar 2013 19:13:33 GMT
hi

I'm using lucene to query from wiki dump and get the categories out. So, I
get the relevant documents and for every document, I call the below
function.

static List<String> getCategories(Document document) throws IOException
{
  List<String> categories = new ArrayList<String>();
  String text = document.get("text");
  WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(text));
  CharTermAttribute termAtt = tf.addAttribute(CharTermAttribute.class);
  TypeAttribute typeAtt = tf.addAttribute(TypeAttribute.class);
  while (tf.incrementToken())
  {
      String tokText = termAtt.toString();
      if (typeAtt.type().equals(WikipediaTokenizer.CATEGORY) == true)
      {
         categories.add(tokText);
      }
   }
   return categories;
}

but it throws the following error (at the while statement)

    Exception in thread "main" java.lang.NullPointerException
   at
org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.zzRefill(WikipediaTokenizerImpl.java:574)
   at
org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.getNextToken(WikipediaTokenizerImpl.java:781)
   at
org.apache.lucene.analysis.wikipedia.WikipediaTokenizer.incrementToken(WikipediaTokenizer.java:200)
   at SearchIndex.getCategories(SearchIndex.java:82)
   at SearchIndex.main(SearchIndex.java:54)

I looked at zzRefill() function but it I'm not able to understand it. Is
this a known bug or something? I don't know what am I doing wrong. The
lucene guys says that the whole wikipediaTokenizer section is in beta and
maybe be subject to changes. I was hoping someone could help me.

thanks
sashidhar

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message