Return-Path: X-Original-To: apmail-incubator-vxquery-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-vxquery-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 500991015F for ; Wed, 10 Apr 2013 00:39:15 +0000 (UTC) Received: (qmail 74637 invoked by uid 500); 10 Apr 2013 00:39:15 -0000 Delivered-To: apmail-incubator-vxquery-dev-archive@incubator.apache.org Received: (qmail 74607 invoked by uid 500); 10 Apr 2013 00:39:15 -0000 Mailing-List: contact vxquery-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: vxquery-dev@incubator.apache.org Delivered-To: mailing list vxquery-dev@incubator.apache.org Received: (qmail 74598 invoked by uid 99); 10 Apr 2013 00:39:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Apr 2013 00:39:14 +0000 X-ASF-Spam-Status: No, hits=1.0 required=5.0 tests=FORGED_YAHOO_RCVD,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [98.137.177.251] (HELO nm16-vm3.bullet.mail.gq1.yahoo.com) (98.137.177.251) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 10 Apr 2013 00:39:09 +0000 Received: from [98.137.12.61] by nm16.bullet.mail.gq1.yahoo.com with NNFMP; 10 Apr 2013 00:38:48 -0000 Received: from [98.136.44.55] by tm6.bullet.mail.gq1.yahoo.com with NNFMP; 10 Apr 2013 00:38:48 -0000 Received: from [127.0.0.1] by smtp110.prem.mail.sp1.yahoo.com with NNFMP; 10 Apr 2013 00:38:48 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1365554328; bh=I01NP8SCkwg9Yh0S90LW8VOQmjcMm+sFjxBhgi4cLpc=; h=X-Yahoo-Newman-Id:X-Yahoo-Newman-Property:X-YMail-OSG:X-Yahoo-SMTP:X-Rocket-Received:Message-ID:Date:From:User-Agent:MIME-Version:To:Subject:References:In-Reply-To:Content-Type:Content-Transfer-Encoding; b=ZwOVdfqeY7fxAafxB15XpN5aJ9gQPXRWFzNbYthL9xYvoabLSilZMTEBZA1SxYnN9/1aj5Nqzg1ttUSnUsfd5thhrTEML34rYJGqAA/oRI4AubeB3sx1nw9grDCWXj2FEGrdFNFinZu1/HXEBlVKkfZfMbpCvshgVqjzXtr38AM= X-Yahoo-Newman-Id: 507906.79939.bm@smtp110.prem.mail.sp1.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: xc.5e.4VM1ljv1Nm.wM8e5LlJg2WSltdZtI3dkA8YoKXl_j V0Jkp0EPGyTO1JWOCk7ykFG76GDO_LPjJrcuUKcLycEkrt9Nnykc9rEILXv. 0AZHDlgLY9ZfT82A2DGxgl6ZEu3WiR9k_XgsTYYDfbteQJUbXsIGlir2Rm3g T3FQl4SIiH.m.2J2bceI.SoAxM0lnPSlBYAUnA813KSNsp6m7OzmfjytNlQL NedpCs6INyY1sHapCTzns.lMJ38gLT1pgyoTsM4DZLMIzpdI08YCbmxWME75 5Ij_GXaPZSuK1PjpvFLxK3dmckuVKvnn2b9MlL4c21Cm3qfpUCDVm0L9hWZZ 5QTQkEGdcAnC.RQ0wjzzai0vDXLRVsPdFhaf.dFSbULQaTqpmCmW.KxyZkMP NYb9ZvsATJcVkojwCyuedOJcrF199nO3eIEbzwhHSVLi7DtUtlsunbFIaLvJ SQc3JGjZBdc58Ib5TEh1iwNWSJBM4McINtZ7X2TehECdb8zs4.ktPBOpJAVX .tF.1osxdMOLucPF8nZ5OaDl_9DKLU4yGKqvK971X4v0iUyX.3RKGcQs96mb 7YUlH.vwAGnE7CX4bDpeNQAWvctd7ayRGfO0CLw-- X-Yahoo-SMTP: ZvR6Aa.swBBi9aze9_P4M914Ag-- X-Rocket-Received: from netbook.local (vborky@76.103.130.241 with plain) by smtp110.prem.mail.sp1.yahoo.com with SMTP; 09 Apr 2013 17:38:48 -0700 PDT Message-ID: <5164B499.20205@yahoo.com> Date: Tue, 09 Apr 2013 17:38:49 -0700 From: Vinayak Borkar User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:17.0) Gecko/20130328 Thunderbird/17.0.5 MIME-Version: 1.0 To: vxquery-dev@incubator.apache.org Subject: Re: [Shepherd] Checking In With A Challenge References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi Dave, The goal of VXQuery is to provide a parallel XML data processing system based on the XQuery language (http://www.w3.org/TR/xquery/). VXQuery achieves high performance by using lots of commodity machines to execute parts of a query in parallel, much like Hive does with SQL queries. In contrast with systems like Hive and Pig, VXQuery is built upon a more flexible data-parallel runtime platform called Hyracks (http://hyracks.googlecode.com). To process large amounts of EDGAR data, one would start by distributing the XML files across disks of different machines running VXQuery in say a folder on each machine (/data/edgar). Q1: A query as follows would then count the total number of XML files across all the machines: count(collection('/data/edgar')) Q2: To answer your question of find 10-Ks (sorted newest to oldest) for a company, you would say (I am making up the field names, but EDGAR contains equivalent fields in each document): for $doc in collection('/data/edgar') where $doc/documentType = '10-K' and $doc/company = 'ACME Inc.' order by xs:dateTime($doc/filingDate) descending return $doc In the upcoming release of VXQuery, the only trick in the book to speed up queries will be to scan data on each machine in parallel to filter and aggregate results locally and then combine the partial results obtained at each machine. One of the projects planned for the future (next release hopefully), is to be able to build an index locally on each machine on various fields of the document so that Q2 can be answered even more quickly by performing an index lookup instead of having to scan all the files. Hope that gives you a glimpse into XML query processing with VXQuery. Vinayak On 4/8/13 5:00 PM, Dave Fisher wrote: > Hi XVQuery Devs, > > I'm your volunteer incubator shepherd for this report. I can see from your activity and report that you seem to again have a need for mentors. > > I signed up in part because I was intrigued by the comments regarding querying the filings at sec.gov managed by Edgar. I know a bit about the structure and challenges of this data. XBRL, HTML and PDF documents contain the accounting data with some great variation, etc. > > Tell me how VXQuery would a quick performant solution to make a query like latest 10K, Total Revenue, All US Companies? Or all 13D Filings of >%5 ownership of US companies in the last year? > > Regards, > Dave >