Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <20090730103740.W81N9.23516.imail@eastrmwml43>
Date: Thu, 30 Jul 2009 10:37:40 -0400
From: <ohaya@cox.net>
To: java-user@lucene.apache.org
Subject: Re: How to index IP addresses?
Cc: Matthew Hall <mhall@informatics.jax.org>
In-Reply-To: <4A71A808.1090808@informatics.jax.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Sensitivity: Normal

Hi Matthew and Narcis,

I think that I found the (original) problem.

It looks like the reason that I was getting all those other terms, which looked to me like the octets, weren't the octets :)...

When I was doing the doc.add(), there were some other numbers (not IP addresses) in the String that I was passing to doc.add(...).

BTW, I did try Narcis' suggestion, changing to NOT_ANALYZED, before I found my problem, and that looked like it made the entire string that I was passing to doc.add(...) as the term, which then, when I searched, didn't get any results.

So, I think the original ANALYZED is ok.

Sorry about that!!

Jim


---- Matthew Hall <mhall@informatics.jax.org> wrote: 
> I'm a little unclear on how you could be getting both "aa.bb.cc.dd" as a 
> term, and then also the octets.
> 
> Are you adding the "contents" field into the index multiple times, 
> possibly with separate analyzers?
> 
> Could you possibly try a test, very simple case?
> 
> Just create an index with a single lucene document, with that documents 
> contents being "aa.bb.cc.dd" and then take a look at the index via Luke 
> again.
> 
> When you look at the terms section (Its what comes up by default) you 
> SHOULD see only "aa", "bb", "cc", and "dd" as the top (and thusly ONLY 
> terms in the index).  This could vary depending on your analyzer, as 
> some will show an index containing only a single term "aa.bb.cc.dd".  
> What I would not expect is an index that would contain both.
> 
> Furthermore by making the field not analyzed you will now have a 
> trickier time searching for it.  As you will need to use a keyword 
> analyzer or something similar to search, which if I'm understanding the 
> spirit of your problem isn't really something that you want to do.
> 
> So, if you could run that test scenario that I've outlined for you I 
> think you should be able to have a nice test bed to see what the results 
> of swapping to different analyzers will have on the data that you are 
> trying to index.  Then, after you have played with that a bit you should 
> be able to re-expand your corpus again, and see if the analyzer you have 
> chosen continues to stand up. 
> 
> I.. had thought that StandardAnalyzer already kept IP addresses together 
> as a single token, but maybe its doing something... special and 
> interesting and thusly you are seeing the behavior that you are describing.
> 
> Matt
> 
> ohaya@cox.net wrote:
> > Hi,
> >
> > Oh.  Ok, thanks!  I'll give that a try.
> >
> > Jim
> >
> >
> > ---- "Armasu wrote: 
> >   
> >> Keyword: Field.Index.NOT_ANALYZED
> >>
> >> -----Original Message-----
> >> From: ohaya@cox.net [mailto:ohaya@cox.net] 
> >> Sent: Thursday, July 30, 2009 4:36 PM
> >> To: java-user@lucene.apache.org
> >> Subject: How to index IP addresses?
> >>
> >> Hi,
> >>
> >> I am trying to index information in some proprietary-formatted files.  
> >>
> >> In particular, these files contain some IP addresses in dotted notation, e.g., aa.bb.cc.dd.
> >>
> >> For my initial test, I have a Document implementation, and after I extract what I need into a String named "Info", I do:
> >>
> >> doc.add(new Field("contents", Info, Field.Store.YES, Field.Index.ANALYZED));
> >>
> >> From looking at the resulting index using Luke, it appears that I am getting terms for the full IP address string (e.g., "aa.bb.cc.dd"), but I am also getting terms for each octet of each IP address string, e.g.:
> >>
> >> aa
> >> bb
> >> cc
> >> dd
> >>
> >> I'm still just getting started with Lucene, but from the research that I've done, it seems like Lucene is treating the "." in the dotted notation strings as "noise".  Is that correct?
> >>
> >> If so, is there a way to get it not to do that?
> >>
> >> Thanks,
> >> Jim
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >>     
> >
> >
> > Amazon Development Center (Romania) S.R.L. registered office: 37 Lazar Street, floor 5, Iasi, Iasi County, Iasi 700049, Romania. Registered in Romania. Registration number J40/12967/2005.
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >   
> 
> 
> -- 
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mhall@informatics.jax.org
> (207) 288-6012
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org