Both models demonstrated high overall performance, with comparable weighted average F1 scores (GPT-4o: 0.9288; Gemini: 0.9350). The models generated consistent predictions for 341 of 385 guideline ...