Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C4F5D905F for ; Thu, 29 Sep 2011 08:54:00 +0000 (UTC) Received: (qmail 15269 invoked by uid 500); 29 Sep 2011 08:54:00 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 15239 invoked by uid 500); 29 Sep 2011 08:54:00 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 15231 invoked by uid 99); 29 Sep 2011 08:54:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Sep 2011 08:54:00 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of kottmann@gmail.com designates 209.85.214.47 as permitted sender) Received: from [209.85.214.47] (HELO mail-bw0-f47.google.com) (209.85.214.47) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Sep 2011 08:53:52 +0000 Received: by bke11 with SMTP id 11so435765bke.6 for ; Thu, 29 Sep 2011 01:53:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=/eveqfDUbYZlaU/RzRJ30RHvHO4dvgFlNAudfsQirrk=; b=xD9TALBws4t8XXk/qRQC6cqIYajg9P+eXR1569uYtBBxR5OBlgHIZpFxmMPyaAVeBL THMD8JyprHnAWUIMf3pW45yi2fFBVkWctc9aDwZhYeKchXxaPPvH3mVkqNpXQo6c+HgW t+iNubRQ7/4dkwXUyMktBZZQTtkGG8kpKgaSo= Received: by 10.204.135.72 with SMTP id m8mr6709097bkt.389.1317286411031; Thu, 29 Sep 2011 01:53:31 -0700 (PDT) Received: from karkand.infopaq.net (dkcphfw01.infopaq.dk. [213.150.59.2]) by mx.google.com with ESMTPS id ex8sm960786bkc.2.2011.09.29.01.53.28 (version=SSLv3 cipher=OTHER); Thu, 29 Sep 2011 01:53:29 -0700 (PDT) Message-ID: <4E843207.7030403@gmail.com> Date: Thu, 29 Sep 2011 10:53:27 +0200 From: =?UTF-8?B?SsO2cm4gS290dG1hbm4=?= User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:7.0) Gecko/20110922 Thunderbird/7.0 MIME-Version: 1.0 To: user@uima.apache.org Subject: Re: UIMA- Support for HTML, PDF, Doc files References: <20110929062802.11666.qmail@pro236-82.mxout.rediffmailpro.com> In-Reply-To: <20110929062802.11666.qmail@pro236-82.mxout.rediffmailpro.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Hello, UIMA itself is just a framework to build analysis pipelines. To analyze HTML, PDF or Word documents you need a component which can extract the text from these formats. You can use Apache Tika together with our Tika integration in the addons project to extract text from various data formats. Jörn On 9/29/11 8:28 AM, abhishek wrote: > Hi, > While reading the docuemntation of UIMA, i found out that UIMA supports html files. >   > However, when i am running the org.apache.uima.tools.docanalyzer.DocumentAnalyzer class, it fails to understand the text. >   > Kindly let me know, the correct way to read these type of files. >