lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Hall <mh...@informatics.jax.org>
Subject Re: How to index IP addresses?
Date Thu, 30 Jul 2009 14:02:48 GMT
I'm a little unclear on how you could be getting both "aa.bb.cc.dd" as a 
term, and then also the octets.

Are you adding the "contents" field into the index multiple times, 
possibly with separate analyzers?

Could you possibly try a test, very simple case?

Just create an index with a single lucene document, with that documents 
contents being "aa.bb.cc.dd" and then take a look at the index via Luke 
again.

When you look at the terms section (Its what comes up by default) you 
SHOULD see only "aa", "bb", "cc", and "dd" as the top (and thusly ONLY 
terms in the index).  This could vary depending on your analyzer, as 
some will show an index containing only a single term "aa.bb.cc.dd".  
What I would not expect is an index that would contain both.

Furthermore by making the field not analyzed you will now have a 
trickier time searching for it.  As you will need to use a keyword 
analyzer or something similar to search, which if I'm understanding the 
spirit of your problem isn't really something that you want to do.

So, if you could run that test scenario that I've outlined for you I 
think you should be able to have a nice test bed to see what the results 
of swapping to different analyzers will have on the data that you are 
trying to index.  Then, after you have played with that a bit you should 
be able to re-expand your corpus again, and see if the analyzer you have 
chosen continues to stand up. 

I.. had thought that StandardAnalyzer already kept IP addresses together 
as a single token, but maybe its doing something... special and 
interesting and thusly you are seeing the behavior that you are describing.

Matt

ohaya@cox.net wrote:
> Hi,
>
> Oh.  Ok, thanks!  I'll give that a try.
>
> Jim
>
>
> ---- "Armasu wrote: 
>   
>> Keyword: Field.Index.NOT_ANALYZED
>>
>> -----Original Message-----
>> From: ohaya@cox.net [mailto:ohaya@cox.net] 
>> Sent: Thursday, July 30, 2009 4:36 PM
>> To: java-user@lucene.apache.org
>> Subject: How to index IP addresses?
>>
>> Hi,
>>
>> I am trying to index information in some proprietary-formatted files.  
>>
>> In particular, these files contain some IP addresses in dotted notation, e.g., aa.bb.cc.dd.
>>
>> For my initial test, I have a Document implementation, and after I extract what I
need into a String named "Info", I do:
>>
>> doc.add(new Field("contents", Info, Field.Store.YES, Field.Index.ANALYZED));
>>
>> From looking at the resulting index using Luke, it appears that I am getting terms
for the full IP address string (e.g., "aa.bb.cc.dd"), but I am also getting terms for each
octet of each IP address string, e.g.:
>>
>> aa
>> bb
>> cc
>> dd
>>
>> I'm still just getting started with Lucene, but from the research that I've done,
it seems like Lucene is treating the "." in the dotted notation strings as "noise".  Is that
correct?
>>
>> If so, is there a way to get it not to do that?
>>
>> Thanks,
>> Jim
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>
> Amazon Development Center (Romania) S.R.L. registered office: 37 Lazar Street, floor
5, Iasi, Iasi County, Iasi 700049, Romania. Registered in Romania. Registration number J40/12967/2005.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   


-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message