Updated results table:
previous xlsr(pre-trained, not finetuned) model eval2 conf matrix:
all of the pred-happiness, true-sadness comes from the audio:
194733f4-59f2-43ba-9f57-a140f9e847df_……
1.) Stuttering dataset - 30_013 3s samples labelled as stuttering events.
2.) Updated validation:
1.) Robust FP test - all other class validation samples checked with step 0.1s
2.) More similar to API - all samples denoised AND diarized, step=1s
Additional architecture tests are still running with wav2vec2 small (96M - about the same param count as currently used resnext)
Previously used 487th model shows problems with robustness :
Best models without FP in robustness test: