lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fred Toth <ft...@synernet.com>
Subject Re: Index + Searching
Date Fri, 15 Oct 2004 12:39:26 GMT
Hi,

I built a MetaTagsManager class.

In HTMLDocument, there is a line of code that causes the doc
to be parsed:

HTMLParser parser = new HTMLParser(f);

After parsing, all of the parse results are available. I added:

Properties prop = parser.getMetaTags();
//prop.list(System.out);
MetaTagManager.process(doc, prop);

Then, in the "process" method of MetaTagManager, you have
access to all of the meta tags as well as the document. You
can add them as any sort of fields you need. A typical bit might be:

         String desc = prop.getProperty(META_DESCRIPTION);
         if (desc != null) {
             System.out.println("DESC: " + desc);
             doc.add(Field.UnIndexed(DOC_DESCRIPTION, desc));
         }


Be warned, though: As has been discussed on this list several times,
IndexHTML has some significant limitations. The most important
one for me has been the fact that it's not unicode-clean.

However, I'm still using it!

Good luck,

Fred


At 07:06 PM 10/14/2004, you wrote:
>Hi Fred,
>
>Thanks for replying. I see what you mean by creating tags for name and 
>description. I am not sure about how to hack the methods and where. If you 
>can point me in right direction that would really be appreciated. I am 
>thinking about editing the shipped HTMLDocument.java and HTMLParser.java 
>files but not sure what I need to add. Can you please explain a little more.
>
>TIA.
>-H
>
>Fred Toth wrote:
>>Hi,
>>Could be your best bet is to use HTML <meta> tags. Create tags
>>for name, description, etc. (title is already parsed). The HTML parser
>>that ships with Lucene will parse these tags into java Properties. You
>>will need to hack a bit, but you can easily pick these up and add them
>>as specific fields to your index.
>>Fred
>>At 02:42 PM 10/13/2004, you wrote:
>>
>>>Hello,
>>>
>>>I am using the IndexHTML class to index around 30,000 files and it is 
>>>working fine. Question that I have is, is there a way to add multiple 
>>>fields to index so that when the actual search is performed I can 
>>>extract the exact match.
>>>E.g.
>>>the fields can be
>>>1) title - abc
>>>2) name - foo inc,
>>>3) description - Lorem ipsum dolor sit
>>>4) URL - www.lorem.ipsum
>>>
>>>and so on,
>>>
>>> From search when the match for title 'abc' is found then searching for 
>>> doc.get("name") can return foo inc and so on.
>>>
>>>Is this already happening in any other indexing class if not what do I 
>>>need to add to IndexHTML class to accomplish this?
>>>
>>>thanks for all the help gang.
>>>-H
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message