ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kim Ebert <kim.eb...@perfectsearchcorp.com>
Subject Re: CTAKES mirroring on github.
Date Mon, 18 May 2015 19:25:34 GMT
Here are the top ten files based upon size worth considering.

100644 blob efe7111e6ca3c84e9ba6cf7622f0271c03407255 99069136  
dictionary lookup/resources/lookup/umls2011ab/umls.backup
100644 blob efe7111e6ca3c84e9ba6cf7622f0271c03407255 99069136  
dictionary lookup/src/main/resources/lookup/umls2011ab/umls.backup
100644 blob 7360ce26c5d37cc81dae83be1593468a4917545c 139521759 
clearparser-wrapper/resources/dependency/mayo-dep.jar
100644 blob 7360ce26c5d37cc81dae83be1593468a4917545c 139521759 
dependency parser/resources/dependency/mayo-dep.jar
100644 blob 7360ce26c5d37cc81dae83be1593468a4917545c 139521759 
dependency parser/src/main/resources/dependency/mayo-dep.jar
100644 blob d785a5bcf608372f273600fa524c3c786fbe76f4 238248287 
ctakes-3.1.0/ctakes-dependency-parser-res/src/main/resources/org/apache/ctakes/dependency/parser/models/clearparser_models.jar
100644 blob d785a5bcf608372f273600fa524c3c786fbe76f4 238248287 
ctakes-dependency-parser-res/src/main/resources/org/apache/ctakes/dependency/parser/models/clearparser_models.jar
100644 blob d785a5bcf608372f273600fa524c3c786fbe76f4 238248287 
dependency parser/resources/clearparser_models.jar
100644 blob 89bee2d613aba238824bab97b74df87306483192 410610240 
dictionary lookup/resources/lookup/umls2011ab/umls.data
100644 blob 89bee2d613aba238824bab97b74df87306483192 410610240 
dictionary lookup/src/main/resources/lookup/umls2011ab/umls.data

I came up with the top ten using the following bash command. I'm sure
there is an easier way to do this, but each google search I do on git to
get the largest files in the repo gives a half baked data or scripts
that are difficult to follow.

git branch -a --list | sed 's/*//g' | sed 's/ *//g' | sed 's/remotes//g'
| sed 's/^\///g' |  sed 's/->origin\/trunk//g' | xargs -I xxx bash -c
'git ls-tree -r -t -l --full-name xxx | sort -n -k 4' | sort -n -k 4 | uniq

I'm not sure it is worthwhile to remove all the resources in the
history, just the extremely large resources to bring the git repo down
to a reasonable size.

IMAT Solutions <http://imatsolutions.com>
Kim Ebert
Software Engineer
Office: 208.971.1509
kim.ebert@imatsolutions.com <mailto:greg.hubert@imatsolutions.com>
On 05/18/2015 12:21 PM, Pei Chen wrote:
> One of the visions behind the *-res projects was to separate out the
> resources from code.  In theory, one can filter out all *-res projects
> from their git repo and pull in any version of the resources from
> maven central...  I won't have enough bandwidth at the moment to try
> it out or work on the git piece though...
> --Pei
>
> On Thu, May 14, 2015 at 1:56 PM, Kim Ebert
> <kim.ebert@perfectsearchcorp.com
> <mailto:kim.ebert@perfectsearchcorp.com>> wrote:
>
>     I've done some investigation into using / working with the git
>     repo for cTAKES, and I found that it is a huge. It doesn't work
>     well with GitHub either, as I keep running into timeouts.
>
>     I would like to make the suggest that we remove two cTAKES build
>     files and the ctakes-gui-0.0.1.zip file. This takes the repo from
>     about 8 GB down to 1.8 GB. It is likely that the reason the git
>     mirror is failing is due to the large size of the repo. GitHub
>     will also filter out some of these vary large files, as GitHub's
>     max file size is 100MB.
>
>     git filter-branch --tree-filter 'rm -rf ctakes-gui-0.0.1.zip'
>     origin/cTAKES-GUI-0.0.1
>     git filter-branch -f --tree-filter 'rm -rf
>     _cTAKES_build_/cTAKES-2.5*.zip' origin/maven-sandbox
>     git filter-branch -f --tree-filter 'rm -rf
>     _cTAKES_build_/cTAKES-2.5*.zip' origin/SHARPn-cTAKES
>
>     # Clean out unreferenced objects from repo
>     git -c gc.reflogExpire=0 -c gc.reflogExpireUnreachable=0 -c
>     gc.rerereresolved=0 \
>         -c gc.rerereunresolved=0 -c gc.pruneExpire=now gc
>
>
>     It may also be helpful to remove
>     ctakes-dependency-parser-res/src/main/resources/org/apache/ctakes/dependency/parser/models/clearparser_models.jar
>     from the git repo as well. (238,248,287 bytes)
>
>     Thoughts?
>
>     IMAT Solutions <http://imatsolutions.com>
>     Kim Ebert
>     Software Engineer
>     Office: 208.971.1509 <tel:208.971.1509>
>     kim.ebert@imatsolutions.com <mailto:greg.hubert@imatsolutions.com>
>     On 05/06/2015 01:17 PM, Steven Bethard wrote:
>>     Yes, I ping this issue every couple months, but no luck so far. (They
>>     take a look each time I ask, but haven't yet pushed a working git
>>     mirror for us.)
>>
>>     Steve
>>
>>     On Tue, May 5, 2015 at 12:09 PM, Kim Ebert
>>     <kim.ebert@perfectsearchcorp.com> <mailto:kim.ebert@perfectsearchcorp.com>
wrote:
>>>     Ah, looks like the issue is still being looked into.
>>>
>>>     https://issues.apache.org/jira/browse/INFRA-8553
>>>
>>>     On Mon, May 4, 2015 at 4:54 PM, jay vyas <jayunit100.apache@gmail.com>
<mailto:jayunit100.apache@gmail.com>
>>>     wrote:
>>>
>>>>     Thanks kim.
>>>>
>>>>     Can you file an infra issue ?
>>>>
>>>>     they will look into it.
>>>>
>>>>     I filed one originally
>>>>     On May 4, 2015 6:32 PM, "Kim Ebert" <kim.ebert@perfectsearchcorp.com>
<mailto:kim.ebert@perfectsearchcorp.com>
>>>>     wrote:
>>>>
>>>>>     It looks like the github hasn't been updated in a while. Any reason?
>>>>>
>>>>>     Thanks,
>>>>>
>>>>>     Kim
>>>>>
>>>>>     On Tue, Feb 17, 2015 at 10:36 AM, Finan, Sean <
>>>>>     Sean.Finan@childrens.harvard.edu <mailto:Sean.Finan@childrens.harvard.edu>>
wrote:
>>>>>
>>>>>>     Our request is for a read-only mirror.  However, if it ever becomes
>>>>     i/o,
>>>>>     I
>>>>>>     don't know if this will have what you want, but http://git.apache.org/
>>>>>>     Links to documentation (mostly server setup)
>>>>>>     http://www.apache.org/dev/git.html and a wiki (check toward middle
and
>>>>>>     bottom for committer info) https://wiki.apache.org/general/GitAtApache
>>>>>>
>>>>>>
>>>>>>
>>>>>>     -----Original Message-----
>>>>>>     From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
>>>>>>     Sent: Tuesday, February 17, 2015 12:31 PM
>>>>>>     To: dev@ctakes.apache.org <mailto:dev@ctakes.apache.org>
>>>>>>     Subject: Re: CTAKES mirroring on github.
>>>>>>
>>>>>>     Is there any existing resource to help people who want to use
git
>>>>>>     understand the right workflow to contribute to ctakes? (i.e.
how this
>>>>>>     interacts with svn repos).
>>>>>>     Tim
>>>>>>
>>>>>>
>>>>>>     On 02/17/2015 12:23 PM, jay vyas wrote:
>>>>>>>     Hi CTakes.  Looks like infra finally got  onto the JIRA i
made for
>>>>>>>     this a while back.  They are currently working on fixing
a couple of
>>>>>>>     minor glitches w/ the mirroring (not showing all commits)...
but
>>>>     there
>>>>>>>     now is a mirror for CTakes on github.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>     https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
>>>>     _ctakes&d=BQIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-
>>>>     IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=4sEI9mOp
>>>>     kTz6K-DjmNU1s8Do1TGA0_10HqJcowKpDxc&s=fNVbyXzpBLSAG6-DIjBZ1vbMp0JGaX90
>>>>>>>     Lcdzg_EFVvM&e=
>>>>>>>
>
>


Mime
View raw message