We conducted a two-phase evaluation. First, we assessed LLMs (GPT o4-mini and Gemini 2.5 Pro) on 1,000 synthetic clinical hematology/oncology vignettes with ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results