Input
image raw
question (MCQ with 4 options) raw
system suffix from paper §C
Output
expected response format paper template + raw GT
ground-truth answer raw
ground-truth bounding box raw (pixel) normalized
evaluation metric from paper
Acc@50IoU · correct iff answer matches GT AND IoU(predicted_bbox, GT_bbox) ≥ 0.5
Raw record from multihop_test_4500.json · entry as-is