lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: More IP/MAC indexing questions
Date Wed, 01 Aug 2007 17:15:24 GMT
First, consider using your own analyzer and/or breaking the IP addresses
up by substituting ' ' for '.' upon input. Otherwise, you'll have endless
issues as time passes......

But on to your question. Please post what you mean by
"a large number". 10,000? 1,000,000,000? we have no clue
from your posts so far...

That said, efficiency is hugely overrated at this stage of your
design. I'd personally use whatever is easiest and run some
tests.

Just index them as single (unbroken) tokens to start and search
your partial address with PrefixQuery. Or index them as
individual tokens and create a SpanFirstQuery. Or...

And measure <G>.

Best
Erick

On 8/1/07, Joe Attardi <jattardi@gmail.com> wrote:
>
> Hi again, everyone. First of all, I want to thank everyone for their
> extremely helpful replies so far.
> Also, I just started reading the book "Lucene in Action" last night. So
> far
> it's an awesome book, so a big thanks to the authors.
>
> Anyhow, on to my question. As I've mentioned in several of my previous
> messages, I am indexing different pieces of information about servers - in
> particular, my question is about indexing the IP address and MAC address.
>
> Using the StandardAnalyzer, an IP is kept as a single token ("
> 192.168.1.100"),
> and a MAC is broken up into one token per octet ("00", "17", "fd", "14",
> "d3", "2a"). Many searches will be for partial IPs or MACs ("192.168",
> "00:17:fd", etc).
>
> Are either of these methods of indexing the addresses (single token vs
> per-octet token) more or less efficient than the other when indexing large
> numbers of these?
>
> --
> Joe Attardi
> jattardi@gmail.com
> http://thinksincode.blogspot.com/
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message