Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id CCF03200CD6 for ; Mon, 31 Jul 2017 22:32:05 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id CB95216384E; Mon, 31 Jul 2017 20:32:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C4662163849 for ; Mon, 31 Jul 2017 22:32:04 +0200 (CEST) Received: (qmail 54901 invoked by uid 500); 31 Jul 2017 20:32:04 -0000 Mailing-List: contact issues-help@systemml.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@systemml.apache.org Delivered-To: mailing list issues@systemml.apache.org Received: (qmail 54748 invoked by uid 99); 31 Jul 2017 20:32:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Jul 2017 20:32:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 7F6DDC0322 for ; Mon, 31 Jul 2017 20:32:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id A8dCkj0LFBgl for ; Mon, 31 Jul 2017 20:32:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 3A7015FB2C for ; Mon, 31 Jul 2017 20:32:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id B0302E0D54 for ; Mon, 31 Jul 2017 20:32:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 5D3202464E for ; Mon, 31 Jul 2017 20:32:00 +0000 (UTC) Date: Mon, 31 Jul 2017 20:32:00 +0000 (UTC) From: "Mike Dusenberry (JIRA)" To: issues@systemml.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (SYSTEMML-1814) Improve slide distribution of the image dataset via improved sampling policy MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 31 Jul 2017 20:32:06 -0000 [ https://issues.apache.org/jira/browse/SYSTEMML-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Dusenberry updated SYSTEMML-1814: -------------------------------------- Summary: Improve slide distribution of the image dataset via improved sampling policy (was: Improve slide distribution of the image dataset via improved filtering ) > Improve slide distribution of the image dataset via improved sampling policy > ---------------------------------------------------------------------------- > > Key: SYSTEMML-1814 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1814 > Project: SystemML > Issue Type: Improvement > Reporter: Mike Dusenberry > Assignee: Mike Dusenberry > > Currently, our models are heavily overfitting on the training dataset. However, further evaluation has shown that this is not the usual overfitting due to an over-expressive model -- in this case we are employing heavy model freezing (as much as only unfreezing the final softmax classifier of a pretrained ResNet50). Therefore, my evaluation has led me to believe that this is likely due to batch effects in the data, and an examination of the original slide distribution in the sample images dataset has shown a severe imbalance. Note, this is the distribution over the slide from which an image originated, and is distinctly different from the class distribution, which is much more reasonably dispersed. > {code} > slide_num count > 0 436 1 > 1 116 1 > 2 468 2 > 3 38 3 > 4 195 4 > 5 173 5 > 6 13 7 > 7 481 8 > 8 83 9 > 9 349 11 > 10 490 15 > 11 292 17 > 12 281 22 > 13 387 26 > 14 326 32 > 15 286 32 > 16 88 39 > 17 477 48 > 18 205 57 > 19 135 58 > 20 127 58 > 21 16 61 > 22 245 66 > 23 5 81 > 24 306 83 > 25 284 91 > 26 263 100 > 27 15 120 > 28 345 124 > 29 380 128 > 30 24 137 > 31 382 150 > 32 1 154 > 33 421 164 > 34 163 169 > 35 278 171 > 36 235 197 > 37 332 197 > 38 343 207 > 39 43 237 > 40 249 246 > 41 113 256 > 42 496 262 > 43 482 264 > 44 86 269 > 45 415 269 > 46 472 326 > 47 422 329 > 48 450 340 > 49 108 348 > 50 3 390 > 51 191 402 > 52 272 474 > 53 85 483 > 54 97 484 > 55 210 508 > 56 293 544 > 57 41 595 > 58 452 613 > 59 220 613 > 60 406 651 > 61 67 665 > 62 260 666 > 63 361 673 > 64 269 684 > 65 50 684 > 66 304 753 > 67 101 769 > 68 433 868 > 69 4 898 > 70 499 915 > 71 145 917 > 72 357 918 > 73 365 940 > 74 82 951 > 75 126 965 > 76 185 965 > 77 164 1077 > 78 221 1086 > 79 165 1111 > 80 316 1129 > 81 350 1132 > 82 89 1162 > 83 19 1169 > 84 74 1206 > 85 132 1248 > 86 47 1278 > 87 188 1297 > 88 459 1312 > 89 368 1337 > 90 335 1368 > 91 225 1373 > 92 234 1378 > 93 487 1385 > 94 247 1464 > 95 427 1476 > 96 65 1492 > 97 402 1500 > 98 315 1557 > 99 201 1604 > 100 344 1607 > 101 273 1616 > 102 146 1623 > 103 341 1636 > 104 425 1640 > 105 182 1681 > 106 403 1682 > 107 275 1690 > 108 457 1717 > 109 448 1724 > 110 277 1729 > 111 70 1740 > 112 141 1747 > 113 264 1777 > 114 122 1880 > 115 319 1915 > 116 449 1951 > 117 104 1988 > 118 377 1993 > 119 285 2008 > 120 107 2084 > 121 410 2141 > 122 11 2148 > 123 367 2153 > 124 416 2162 > 125 311 2183 > 126 338 2206 > 127 51 2233 > 128 153 2255 > 129 144 2285 > 130 497 2358 > 131 218 2364 > 132 330 2376 > 133 308 2392 > 134 213 2480 > 135 454 2512 > 136 103 2567 > 137 446 2569 > 138 40 2622 > 139 251 2629 > 140 149 2632 > 141 455 2633 > 142 430 2669 > 143 262 2715 > 144 76 2737 > 145 18 2748 > 146 178 2763 > 147 383 2864 > 148 54 2871 > 149 223 2908 > 150 207 2931 > 151 486 3043 > 152 391 3099 > 153 342 3104 > 154 390 3116 > 155 276 3136 > 156 75 3141 > 157 181 3171 > 158 142 3213 > 159 414 3255 > 160 137 3276 > 161 295 3285 > 162 358 3315 > 163 7 3322 > 164 323 3327 > 165 71 3334 > 166 243 3344 > 167 120 3359 > 168 48 3371 > 169 434 3387 > 170 206 3404 > 171 9 3460 > 172 476 3467 > 173 32 3472 > 174 491 3496 > 175 444 3502 > 176 279 3530 > 177 59 3546 > 178 174 3556 > 179 464 3595 > 180 392 3633 > 181 99 3677 > 182 72 3682 > 183 347 3779 > 184 28 3804 > 185 314 3807 > 186 322 3809 > 187 492 3823 > 188 258 3824 > 189 230 3831 > 190 354 3887 > 191 346 3951 > 192 445 3963 > 193 209 3969 > 194 8 3986 > 195 443 3988 > 196 290 3993 > 197 118 4025 > 198 152 4026 > 199 56 4078 > 200 170 4131 > 201 84 4146 > 202 413 4150 > 203 447 4171 > 204 417 4193 > 205 60 4210 > 206 92 4265 > 207 374 4281 > 208 94 4307 > 209 161 4360 > 210 320 4408 > 211 114 4451 > 212 219 4480 > 213 90 4518 > 214 233 4528 > 215 396 4596 > 216 157 4661 > 217 117 4696 > 218 337 4724 > 219 202 4819 > 220 34 4827 > 221 105 4840 > 222 155 4841 > 223 176 4895 > 224 166 4966 > 225 456 5031 > 226 254 5085 > 227 475 5184 > 228 42 5221 > 229 172 5330 > 230 299 5358 > 231 473 5364 > 232 131 5369 > 233 61 5382 > 234 379 5470 > 235 355 5488 > 236 372 5496 > 237 53 5503 > 238 17 5523 > 239 495 5529 > 240 190 5536 > 241 451 5583 > 242 177 5630 > 243 123 5649 > 244 231 5686 > 245 217 5692 > 246 33 5742 > 247 55 5767 > 248 388 5786 > 249 318 5819 > 250 81 5838 > 251 62 5846 > 252 255 5854 > 253 485 5890 > 254 375 5928 > 255 156 5938 > 256 224 5945 > 257 267 5970 > 258 412 5987 > 259 136 6038 > 260 160 6055 > 261 240 6084 > 262 39 6093 > 263 469 6100 > 264 300 6167 > 265 183 6178 > 266 250 6195 > 267 49 6231 > 268 471 6251 > 269 334 6283 > 270 265 6422 > 271 407 6468 > 272 252 6472 > 273 466 6478 > 274 227 6528 > 275 102 6550 > 276 458 6653 > 277 140 6667 > 278 133 6668 > 279 493 6716 > 280 465 6729 > 281 370 6751 > 282 244 6772 > 283 216 6772 > 284 488 6773 > 285 95 6777 > 286 52 6788 > 287 57 6821 > 288 289 6846 > 289 362 6939 > 290 180 6944 > 291 324 6961 > 292 211 7012 > 293 73 7034 > 294 301 7094 > 295 23 7106 > 296 64 7169 > 297 420 7182 > 298 36 7219 > 299 376 7257 > 300 484 7265 > 301 253 7275 > 302 470 7312 > 303 460 7405 > 304 98 7425 > 305 302 7427 > 306 393 7435 > 307 159 7554 > 308 237 7564 > 309 274 7701 > 310 359 7769 > 311 68 7779 > 312 483 7829 > 313 151 7910 > 314 186 7948 > 315 442 7952 > 316 259 8049 > 317 246 8128 > 318 96 8129 > 319 271 8176 > 320 438 8190 > 321 87 8197 > 322 162 8226 > 323 489 8260 > 324 418 8312 > 325 31 8504 > 326 179 8532 > 327 79 8578 > 328 226 8600 > 329 27 8719 > 330 479 8862 > 331 268 8883 > 332 404 8908 > 333 46 8913 > 334 437 8961 > 335 147 9047 > 336 189 9164 > 337 20 9242 > 338 386 9356 > 339 435 9376 > 340 432 9495 > 341 408 9505 > 342 248 9509 > 343 462 9619 > 344 229 9774 > 345 193 9835 > 346 167 9871 > 347 69 9894 > 348 130 9954 > 349 327 10072 > 350 369 10078 > 351 106 10180 > 352 194 10212 > 353 325 10306 > 354 312 10344 > 355 303 10502 > 356 184 10655 > 357 463 10916 > 358 426 11055 > 359 283 11334 > 360 328 11450 > 361 129 11467 > 362 288 11806 > 363 124 12010 > 364 171 12250 > 365 121 12257 > 366 22 12276 > 367 423 12310 > 368 192 12313 > 369 378 12358 > 370 307 12366 > 371 143 12678 > 372 80 12899 > 373 66 12920 > 374 208 12970 > 375 158 13131 > 376 148 13423 > 377 119 13723 > 378 317 13830 > 379 395 13834 > 380 187 14003 > 381 25 14856 > 382 399 14905 > 383 478 16145 > 384 93 20009 > 385 215 20723 > {code} > This task will aim to improve the preprocessing algorithm to yield a more even slide distribution in the final image dataset, hopefully reducing the batch effects, and leading to improved model metric performance. -- This message was sent by Atlassian JIRA (v6.4.14#64029)