Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 38990 invoked from network); 26 Apr 2005 07:21:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 26 Apr 2005 07:21:16 -0000 Received: (qmail 72789 invoked by uid 500); 26 Apr 2005 07:21:52 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 72761 invoked by uid 500); 26 Apr 2005 07:21:51 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 72729 invoked by uid 99); 26 Apr 2005 07:21:50 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (hermes.apache.org: 217.11.48.103 is neither permitted nor denied by domain of fast.jack@gmx.net) Received: from web03.manitu.net (HELO web03.manitu.net) (217.11.48.103) by apache.org (qpsmtpd/0.28) with ESMTP; Tue, 26 Apr 2005 00:21:50 -0700 Received: from localhost (p5489F25F.dip.t-dialin.net [84.137.242.95]) (authenticated) by web03.manitu.net (8.10.2-SOL3/8.10.2) with ESMTP id j3Q7L5r07164 for ; Tue, 26 Apr 2005 09:21:05 +0200 Message-ID: <426DEBE1.1060309@gmx.net> Date: Tue, 26 Apr 2005 09:21:05 +0200 From: Daniel Stephan User-Agent: Mozilla Thunderbird 1.0.2 (Windows/20050317) X-Accept-Language: de-DE, de, en-us, en MIME-Version: 1.0 To: general@lucene.apache.org Subject: Indexing of virtual "made up" documents Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit X-Antivirus: avast! (VPS 0517-0, 25.04.2005), Outbound message X-Antivirus-Status: Clean X-Virus-Checked: Checked X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Hi there, lets see if somebody listens on this list :-D I wonder if the following is possible with Lucene. I would like to add documents to the index, which aren't real documents. :-) Meaning: there is no text to parse and tokenize. What I have is a number of features, some are simple words, some are combinations of words. Those features classify an entity in my database. I have also an own parser/analyzer/tokenizer which is able to take a text and extract those features from it. (Possibly a query.) So, I wanna do sth like (pseudo code): Lucene.index(myEntity.getId, myEntity.getDescriptors) and then when a query was issued: List entityIds = Lucene.query(myQuery.convertToLuceneQueryLanguage) I was looking at the source and couldn't find a possibility to get rid of the analyzing stage to hand the features to Lucene myself. One possibility would be to use an analyzer which only considers whitespace as delimiter and set all descriptors as one string. This feels suboptimal, because I have them as single tokens already and concatenating them first, to let lucene tokenize them again, should not be necessary. Also, the neighbour information isn't applicable in my scenario. It seems you use placement of terms somehow. I don't have placement information. Would that hurt Lucene? I am not sure how Lucenes uses the placement information, but in the described case where I concatenate all my features to a whitespace-delimited text, I fear that Lucene uses the placement of features in this made-up text and comes to some wrong conclusions (after all, the placement is arbitrary in the "made-up" text). Also 2, I am not sure yet, how the converter would have to look like. After all the terms in the query have to be of the same form as those in the index, otherwise they wouldn't match. Can I inject my own analyzer only for the query part, so that lucene hands it phrases and lets it build features from those phrases? Any info is appreciated. I could maybe build my own simple index, the analyzer is already there, but I would prefer to use a professional solution with a good query language and some additional nice-to-have-features. May I use Lucene? :-) Best wishes, Daniel