Both models demonstrated high overall performance, with comparable weighted average F1 scores (GPT-4o: 0.9288; Gemini: 0.9350). The models generated consistent predictions for 341 of 385 guideline ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results