lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Indexing multiple keywords in one field?
Date Sun, 29 May 2005 23:38:32 GMT

On May 29, 2005, at 8:29 AM, Doug Hughes wrote:

> Hi,
>
> I'm working on a pretty typical web page search system based on  
> lucene.
> Pretty much everything works great.  However, I'm having one  
> problem.  I
> want to have a feature in this system where I can find all pages  
> which link
> to another page.  So, for instance, I might search for all the  
> pages linked
> to http://www.foobar.com/index.html.  The search term does not need  
> to be
> fuzzy in any way.  http://www.foobar.com would not match
> http://www.foobar.com/.  The thing is that any for any given  
> document I
> could have any number of associated links.
>
> I think that each page's links could be treated as an array of  
> keywords.
> However, I don't know the best practice for indexing this data or  
> how to
> find matches for specific links.

One possibility is to extract the links (XPath could do this with a  
"//a" pattern) during a parsing phase, not during an analyzer.  Build  
a list of links and index each one as a separate Field.Keyword()  
field for a single document.

> I tried creating a LinebreakAnalyzer which (I think) tokenized  
> phrases based
> on CRs and LFs. I converted the array of links to a list of links  
> delimited
> by LFs.  When indexing I used the PerFieldAnalyzerWrapper and set  
> the links
> field to use the LinebreakAnalyzer.  My understanding is that the  
> lucene
> index should now have each of the links indexed as separate terms or
> keywords (sorry if my vocabulary is wrong!)

Links are broken per line?  For general HTML parsing you certainly  
cannot assume that, but maybe in your documents you can?  I'd be  
surprised at that though.

> Now, all that seems to work fine.  However, when I search I build I  
> query
> using this code:
>
> QueryParser.parse(link, "links", new LinebreakAnalyzer())
>
> The link is the link I'm searching for, "links" is the field I'm  
> searching.
> I'm using the same analyzer I used to index the links.  The problem  
> is I
> don't get any matches at all when I execute the search.
>
> Does anyone know of any better techniques for this?  Or does anyone  
> see
> anything I'm doing wrong

The first thing to do is ensure what you think was indexed really  
was.  I highly recommend you get Luke - http://www.getopt.org/luke/ -  
and explore the index you've built and see what terms were indexed in  
your links field.  Then experiment using the TermQuery API for the  
_exact_ terms indexed.  Only then move up to QueryParser if you need  
that kind of thing, using Query.toString() to dump the generated  
query instance and see what it is made of.  QueryParser introduces a  
level of complexity that can be confusing because there is query  
expression operators, parsing, and analysis all mixed together - and  
some characters in a URL within a QueryParser expression will need to  
escaped to be interpreted properly (like the ":" in "http://").

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message