lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eugenio Martinez <>
Subject I have found a kind of strange behavior in StandardAnalyzer
Date Mon, 26 Nov 2007 15:47:47 GMT
I am indexing with Lucene a hughe set of logfiles, about 130GB of plain text in disk (up to
now), planning to build a system capable of perform searches over Terabytes of such info in
a kind of metaindex built from a mesh of little ones, all of them created and maintained with

I have randomly variable file sizes, from 1KB to several hundreds of MB of plain text, and
I have done tests with files about 2GB, obtaning very good performance in time and search.
Of course, once we can get search results from such system we get confident that Lucene was
capable of doing right its job, i.e., split all contents and index all tokens correctly.

But last week, with our first beta release in our LAN environment, some problems arose. In
certain situations we've found that the Analysis stage "fails", or better, has anomalies in
its activity. We have isolated one, that can be reproduced with LUKE in its Search window:
parsing URL domains that end with a point, as in "" becomes in a token with
the following text: "wwwmydomaines".

Maybe this behavior extends to emails, as we aren´t able to get search results with some
emails that are indeed in the contents of the logfile, and with words too.

Such behavior is not acceptable for nobody, as in natural speaking is possible to find such
URLs at the end of a sentence. Is this an effect of document vectorization? I write this as
log's content structure doesn't match for natural language rules...

Any notice about this?

We are working on an Log Analyzer now, but i'm sure i'm not the only fellow with this issue
in the world... Did you know anyone else?

Thanks for your attention.
Eugenio F. Martínez Pacheco

Fundación Instituto Tecnológico de Galicia - Área TIC

TFN: 981 173 206            FAX: 981 173 223


¿Chef por primera vez?
Sé un mejor Cocinillas.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message