Return-Path: Delivered-To: apmail-lucene-nutch-dev-archive@www.apache.org Received: (qmail 60393 invoked from network); 6 Feb 2007 14:04:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Feb 2007 14:04:29 -0000 Received: (qmail 735 invoked by uid 500); 6 Feb 2007 14:04:33 -0000 Delivered-To: apmail-lucene-nutch-dev-archive@lucene.apache.org Received: (qmail 708 invoked by uid 500); 6 Feb 2007 14:04:33 -0000 Mailing-List: contact nutch-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-dev@lucene.apache.org Delivered-To: mailing list nutch-dev@lucene.apache.org Received: (qmail 690 invoked by uid 99); 6 Feb 2007 14:04:33 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Feb 2007 06:04:33 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Feb 2007 06:04:25 -0800 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id AC93C7142B6 for ; Tue, 6 Feb 2007 06:04:05 -0800 (PST) Message-ID: <12972397.1170770645704.JavaMail.jira@brutus> Date: Tue, 6 Feb 2007 06:04:05 -0800 (PST) From: "Enis Soztutar (JIRA)" To: nutch-dev@lucene.apache.org Subject: [jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring In-Reply-To: <7787089.1170768905614.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-439: -------------------------------- Attachment: tld_plugin_v1.0.patch This is a plugin implementation for indexing and scoring top level domains in nutch. Tlds are stored in TLDEntry class, which has fields domain, status and boost fileds. The tlds are read from an xml file. There is also a xsd for validation. TLDIndexingFilter implements IndexingFilter interface to index the domain extensions (such as "net", "org", "en", "de") in the tld field. TLDScoringFilter implements ScoringFilter interface. Basically this filter multiplies the initial boost(coming from another scoring filter such as opic) by the boost of the domain. This way, by configuring boost of say "edu" domains to 1.1, the document boosts in the index of educational sites is boosted by 1.1. Also local search engines may wish to boost the domains hosted in that country. For ex. boosting "de" domains a little in a German SE seems reasonable. An alternative usage may be to lower the boosts of domains such as biz, or info, which are known to have lots of spam. The users can also query the tld field for advanced search. Implementation note : 1. OpicScoringFilter is changed to respect ScoringFilter chaining. 2. some of the second level domains such as co.uk is not recognized, but edu.uk is recognized > Top Level Domains Indexing / Scoring > ------------------------------------ > > Key: NUTCH-439 > URL: https://issues.apache.org/jira/browse/NUTCH-439 > Project: Nutch > Issue Type: New Feature > Components: indexer > Affects Versions: 0.9.0 > Reporter: Enis Soztutar > Attachments: tld_plugin_v1.0.patch > > > Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.