lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eugenio Martinez <>
Subject Re: Potential bug in StandardTokenizerImpl
Date Tue, 27 Nov 2007 10:39:58 GMT

 I am the guy who throw the question about the Acronym - Host detection anomaly in the StandardAnalyzer

Thanks to Shai Erera for traslating the discussion into the developers' list. I am surprised
about Chris Hostetter's response, as this issue was treated by Erik Hatcher in Novemeber 22,
2005. I am exploring Hatcher's superb book now, Lucene in Action, trying to override this
issue, but i can't believe that this wasn't fixed yet.

As i explained at the user's list, i've found that indexing fails to include certain emails
and words that are present in the logfile when i launch an IndexWriter over a hughe directory
of logs. As I tried to isolate this bug, I got the acronyms' interpretation issue. Maybe there
will be more hidden anomalies in the StandardAnalyzer behavior with such a hughe load.

At this moment I can say this behavior is deterministic, so I can reproduce it over subsequent
index and search calls, and takes place with the same words and emails over and over. Should
it be a collateral efect of document vectorization as the logs are not natural language? As
Lucene computes if the token conveys relevant info (as the vector space model states), what
about that Lucene decided about the token not to be relevant? All of this supossing it works
well, of course...

Any idea about this, or have you heard about?

Thanks and regards.

Eugenio F. Martínez Pacheco

Fundación Instituto Tecnológico de Galicia - Área TIC

TFN: 981 173 206            FAX: 981 173 223


¿Chef por primera vez?
Sé un mejor Cocinillas.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message