Return-Path: X-Original-To: apmail-opennlp-users-archive@www.apache.org Delivered-To: apmail-opennlp-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 91D6C9711 for ; Fri, 20 Apr 2012 13:53:55 +0000 (UTC) Received: (qmail 97412 invoked by uid 500); 20 Apr 2012 13:53:55 -0000 Delivered-To: apmail-opennlp-users-archive@opennlp.apache.org Received: (qmail 97381 invoked by uid 500); 20 Apr 2012 13:53:55 -0000 Mailing-List: contact users-help@opennlp.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@opennlp.apache.org Delivered-To: mailing list users@opennlp.apache.org Received: (qmail 97372 invoked by uid 99); 20 Apr 2012 13:53:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Apr 2012 13:53:55 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of kottmann@gmail.com designates 209.85.214.49 as permitted sender) Received: from [209.85.214.49] (HELO mail-bk0-f49.google.com) (209.85.214.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Apr 2012 13:53:47 +0000 Received: by bkcjk13 with SMTP id jk13so10931402bkc.22 for ; Fri, 20 Apr 2012 06:53:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=dZu9PLOPYkH8Z3ZMK8HNJEjOu8z3eaUyTsKTtUDq5cs=; b=0jr7J18nLz46tMmGJcw2pexBEPrna5dCzPKtCuLoK6ACZ3/VTGtE4cM0+eEL6Sniw2 IoydMVPSEnK7MJmGzL/XjUeuHf3eqjdxmNGnCPv9jlco7NB5ih2y6np95JesCN7PgTp4 iuME2yhMrCcbyRztX40oG7SupU6RmKur3Ncivre8k7BLrnOPMECbU1UNXXb2CqBDK9vj uC6bBvHt3C3iX2sx4/fAY9+wvYzeZOKnO3nOaOOB9sCzkZ19WuwuiZMFl/GuQbrL/t/s xofx8lF0D8xOlnS9tzvCE3ZGUyyfvSa0yw3MFqmyZRpjmW8J4SIky91FMbj6oCqvSA/S 94NA== Received: by 10.204.133.196 with SMTP id g4mr2017809bkt.0.1334930005935; Fri, 20 Apr 2012 06:53:25 -0700 (PDT) Received: from [192.168.0.134] ([195.218.7.44]) by mx.google.com with ESMTPS id f11sm10189706bkw.6.2012.04.20.06.53.23 (version=SSLv3 cipher=OTHER); Fri, 20 Apr 2012 06:53:24 -0700 (PDT) Message-ID: <4F916A52.6010408@gmail.com> Date: Fri, 20 Apr 2012 15:53:22 +0200 From: =?ISO-8859-1?Q?J=F6rn_Kottmann?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20120329 Thunderbird/11.0.1 MIME-Version: 1.0 To: users@opennlp.apache.org Subject: Re: Post Address Parsing and OpenNLP References: <4F9164D5.8040109@gmail.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org That should work, you might want to include street and square if its part of the name. The tags need to be separated by white spaces, otherwise the parser fails to recognize them. You will need a quite a few samples to train it, 10 or 20 will not be enough. J�rn On 04/20/2012 03:45 PM, mauro fraboni wrote: > I was thinking to train with a file made in this way: > > viamassarenti> 300,Bologna,BO,IT>END>. > piazzamaggiore> 3,Trento,TN,IT>END> > ............ > > > via (meaning is street) and piazza (meaning is square) are two descriptors > that could not be classified according to my opinion. > > ciao > > On Fri, Apr 20, 2012 at 3:29 PM, Jim - FooBar();wrote: > >> On 20/04/12 14:16, mauro fraboni wrote: >> >>> I am investigating if it is possible to use OpenNLP to parse italian post >>> addresses. >>> I do not want to validate the input address using an official address >>> database; I just need to divide a single address string into its >>> individual >>> component parts and I thought to use NameFinder. >>> My idea was to train Name Finder using some italian addresses indicating >>> in >>> training data the parts like Street, Town, Province, Post Code, Country >>> Do you think that it can work? Someone has experience about it? >>> >>> Thanks and ciao. >>> >>> >> Hmmm, that sounds like it should work....however you don't want to >> separate your entities to Street, Town, Province, Post Code, Country etc >> cos then how are you going to join them to get your 'real' entity >> (address)? I would say keep the whole address as 1 entity and produce some >> training data that mark the whole thing...of course if you already have >> some training is better otherwise you will spend a bit of time creating >> your annotated corpus... >> >> My logic says that this is the way to go - maybe I'm wrong is some way.... >> Any different opinions anyone? >> >> Jim >> >> ps. In your first sentence did you by any chance mean to say "recognise" >> instead of "parse"? >>