CORE Logo

Comprehensive Ontological Relation Evaluation (CORE)

Paper
HuggingFaceDataset
load_dataset("vaikhari-ai/core")
GitHubLeaderboard

Aman Gupta 1,Anjali Kumari 1,Ankit Raj Gupta 1,Ankita Keshri1,Bhaskar Singh1,Bipasha Paul1,Chitranshi Tiwari 1,Deepanshu Patel1,Harsh Mishra 1,Himesh Jee Amar1,Kajal Kumari1,Mahi Doshi 1,Muskan Chaudhary 1,Nancy Mittal1,Priyanshu Kumar 1,Rahul Kumar1,Rameesa Azma1,Rasi Shil1,Vineet Kumar 1,Warisha Quatil 1,Subhash Bharti2,Achala C3,Ananya R Naik3,Ananyabm3,Anjali Ajith 3,Ankitha Ks3,Ashitha 3,Ayesha Banu3,B.Tanmayi3,Basavasiri H L 3,Bhagyashree Hokrani 3,Bhoomika.P3,Chinmayi Mohan 3,Dhanalakshmi.N3,Divya Kn3,E.Sai Sruthi3,Hanasi Matada Eshwari 3,Harini Nayaka Gm 3,Harshitha R3,Jeevika Ks 3,Keerthana B3,Lisha S Kumar 3,M.R.Meghana3,Manogna Keshav3,Manya. E. A3,Meghana M3,Megharani3,Mohammed Ayesha Tahreem3,Nanditha M 3,Nithyashree.H 3,Punyashree P R3,R Veena3,Rakshitha M3,Ruchitha S3,Sahana.D.S3,Sai Pallavi 3,Sameeksha S3,Sandhya R 3,Sanika 3,Sanjana G Rao 3,Shafna Ms 3,Sharanya S Prasad 3,Shravya.H3,Shruthi Reddy3,Sinchana3,Sinchana S3,Siri N Murthy3,Siri Patel M 3,Sk Nikitha Reddy 3,Snehaganga N S3,Sowndarya B3,Spoorthi U3,Subhangi Dutta3,Syeda Saneen 3,Tammisetty Harini3,Thanushree M R 3,Thanushree S T 3,Varsha Suresh3,Varshitha S 3,A Vijay Aditya4,Aasish4,Aayush Bhat4,Abhijeet Singh4,Abhishek Chauhan4,Abhishek Kumar Maurya4,Abhyudaya4,Adarsh Kumar Gupta 4,Addagalla Lakshmi Sowjanya4,Adi Akhilesh Singh4,Aditi Gupta4,Aditya Prakash 4,Aditya R Jadhav4,Aditya Raj4,Aditya Singh4,Aishwarya Agnihotri4,Ajay Patel4,Ajay Singh4,Akanksha Singh4,Akshita Ravichndran4,Akula Manasa 4,Allu Deekshita4,Aman Kumar4,Aman Kumar Yadav4,Amardeep Jarwal4,Amit Negi 4,Amit Singh 4,Angraj Shah4,Anjali Kumari4,Anjaneya Raj Garg4,Ankit Prakash4,Ankit Sinha4,Anshu Kumar Ram4,Anshu Yadav4,Anupurba Dhara4,Anurag Kamboj4,Anushka Choudhary4,Arushi Gupta4,Aryan4,Aryan Parihar4,Ashutosh Singh4,Atmadeep Bhattacharya4,Avanish Dhapare4,Awaneesh Kumar Pandey4,Ayush Barot 4,Ayush Kumar4,Ayush Mondal4,Ayush Sharma4,Ayush Tripathi4,Banoth Nandineshwar4,Bellala Mukesh4,Bhanu Verma4,Bhavya Singh 4,Bhupendra Yadav4,Bommadi Mukesh Kumar Reddy4,Brijesh Kumar4,Chaudhary Digvijay Daniel Singh4,Check__Email4,Chelsi Narang4,Chennadi Pavan Sainath Reddy4,Chivukula Sri Eswar Balaji4,Deen Dayal Prajapati4,Deepaprakash K4,Deepjyoti Rabha4,Dhruvi Rajeshbhai Mahyavanshi4,Dipti Gupta4,Divyanshu Yadav4,Durgam Arun Kumar4,Faiz Aman4,Farah Adiba4,Fizaan Khan4,Ganesh Sakkarge4,Ganguly Singh4,Gaurish Maheshwari4,H Poojan4,Happy Kannaujiya4,Harsh4,Harsh Kadiyan4,Harsh Kumar4,Harsh Vardhan4,Harshit Virmani4,Harshita Rajput4,Harshvardhan Goplani4,Hrishabh Deshmukh4,Ishika Saini4,Jagat Jyoti Sarkar4,Jain Aditya Avinash4,Jayesh Sewlani4,Kali Chopra4,Kalyanam Pranay4,Kamal4,Kanukollu Sateesh Kumar 4,Karishma Santani4,Kartikeya Pandey4,Kaushik Kumar4,Kishore Nayak D4,Kolgane Sanskruti Sanjay4,Komal Bhalotia4,Kratika Maheshwari4,Kritarth4,Kshitij Kumar4,Kumar Pundareekaksh4,Kumar Shubham4,Kushagr Kapoor4,Lalit Tolani4,Lalithya C 4,M Balasubramanian4,Madhur Vilas Bahadure4,Malipatel Sravan Kumar Reddy4,Manas Jayaswal4,Manav Gangwar 4,Manish Kumar4,Manisha Bishnoi4,Mannuru Venkateswarlu4,Mayank Agrawal 4,Moksh4,Motilal Bhatra4,Mradul Misra4,Mridul Gupta4,Mrinal Jain4,Naitik Jain4,Nakshatra Shivhare 4,Neelam Meena4,Nida Rahma.A4,Nikhil Deshmukh4,Nikita Gupta4,Nimish Thakur 4,Nitesh Singh4,Nitin Agrawal4,Nitish Kumar4,Ojasvi Tripathi4,Ojaswi Pandey4,Om Abhishek4,Om Shankar4,Omkaran 4,Pakhi Awasthi 4,Parteek Yadav4,Pavani Gupta 4,Pooja Yadav4,Prabhankur4,Prafull Kumar Deepak4,Prakhar Nema4,Pranjali Yadav4,Prashant Nautiyal4,Prathu Tripathi 4,Priti Kumari4,Priyadarshi Annupam4,Priyanshu Tiwari4,Purnima Singh4,Rachit Mittal4,Ragula Eeshareddy4,Rahul Kumar Sonkar4,Rajat Varshney4,Ramgopal Verma4,Raskar Aniket Dattatray4,Ratnesh Kumar Sharma 4,Ravi Kumar4,Rishabh Singh4,Rishabh Yadav4,Rishi Mishra4,Rishi Soni4,Rishit Pal4,Ritesh Soni4,Ritik Rai4,Ritik Raj 4,Ritik Raushan4,Rituraj Barai4,Rohan Sharma4,Rohit Pandey4,Rohit Prasad4,Saarang Kumar4,Sagar Sachan4,Sahil Shekhar4,Saksham Goel4,Samarth Jain4,Samir Kumar 4,Sammit Dhar4,Sampat Meena4,Sapavat Sravan4,Saptarshi Chakraborty4,Sarthak Shewale4,Saurabh Kumar4,Shashank Kumar4,Sheetal Nagar4,Shikha Kaloniya 4,Shivangi Gupta4,Shivansh Gupta4,Shivanshu Kumar4,Shreyam Chaurasia 4,Shreyansh Singh4,Shrija Tiwary4,Shubham Kumar4,Shubham Patel4,Shubhendra Taneja4,Siddhant Bhardwaj4,Siddharth Prakash4,Soham Abhay Kadam4,Sonali Singh4,Sonu Sourabh4,Sourashis Das4,Soustab Haldar4,Sparsh Gupta 4,Srajan Seth4,Srishti Jaiswal4,Sudhanshu Ranjan4,Suharsh Sonkar 4,Suman Kumar4,Sunanda Pandey4,Surkanti Harshitha Reddy4,Sushank 4,Swapnil Wakankar 4,Tanay Ahir4,Tanish Jangir 4,Tanishka Nama4,Tanishq Gupta4,Tarani Mishra4,Tejavath Sudhakar4,Tushar Sarda4,Udeechi Srivastav4,Utkarsh Srivastava4,Vaddadi Lakshmi Sri Sai Srinivas4,Vadithya Rajagopal4,Vaibhav Jain4,Vaibhav Saini4,Vanshika4,Vedant Bhoruka4,Vijay Kumar 4,Vikash Kumar4,Vineet Tyagi4,Vipul Bharti4,Vishal4,Vishisht Dubey4,Vishnu Katara4,Vishvender Pachaar4,Vivek Kumar4,Yash Agarwal4,Yash Sachan4,Aadithya Balachandran5,Abhay5,Aditya Sharma5,Akshay5,Barza A K5,Bhishen Kumar Sahu5,Chinmayee Mohapatra5,Dhairya Yadav5,Divyanka Swarna5,Hariprasad Doley5,Karthika P5,Khushi5,Lugai Kamei5,Manasi Anil Lamsoge5,Mojum Kamduk5,Neeraj N Shetty5,Panduru Tanisha5,Rohit B Sharma5,Sai Sudeep Das5,Sara Singh5,Sharon Valui5,Sheersha Roy5,Shivang Jaiswal 5,Shweta Umrankar5,Soumya Jain5,Sumayya Ayesha5,Suvrojit Nath5,Tanisha5,Vanshika Gupta5,Zitaksyor Sonowal5,Mohima Narzary6,Pratiksha Rabha6,Ruba Das6,Shruti Dekaraja6,Yuktashree Hazarika6

Affiliations

1 BHU2 Galgotias University3 GSSSIETW4 IIT BHU5 IIT Delhi6 Tezpur University

Introduction

Large Language Models (LLMs) have demonstrated strong performance across a broad range of reasoning benchmarks. However, existing evaluations largely emphasize recognition of a limited subset of semantic relations, leaving two fundamental capabilities insufficiently examined: (a) robust reasoning across a comprehensive set of sense-level relations, and (b) reliable identification of cases in which no meaningful semantic relation exists between concepts. Both are essential for dependable reasoning, yet the latter, in particular, remains largely unmeasured.

We introduce Comprehensive Ontological Relation Evaluation (CORE), a large-scale dataset comprising 225K multiple-choice questions (MCQs) spanning 74 disciplines, including STEM, Humanities, and Social Sciences. A portion of CORE is released for fine-tuning and instruction tuning, while the remainder is reserved for evaluation to mitigate contamination and overfitting. From this larger corpus, we release a general-domain evaluation subset, referred to as the CORE benchmark, designed to evaluate LLMs across 24 semantic relation types with equal representation of related and unrelated concept pairs.

The CORE benchmark consists of 203 rigorously validated MCQs, each accompanied by human-authored explanations. A large-scale human baseline constructed from responses by over 1,000 participants achieves 92.6% accuracy, establishing a strong empirical reference. In contrast, 29 state-of-the-art models achieve only 48.25–70.9% overall accuracy, despite near-ceiling performance on related pairs (89.4–100%) and severe degradation on unrelated pairs (0–41.35%). These results indicate that failures in unrelatedness reasoning substantially constrain overall model reliability. CORE enables systematic measurement of this limitation and supports analysis of semantic reasoning robustness.

Dataset

CORE is a large-scale dataset comprising 225K MCQs across 74 academic disciplines, constructed to support training, evaluation, and diagnostic analysis of relational reasoning. To reduce evaluation contamination, only a subset is released for fine-tuning, while the remainder is reserved exclusively for benchmarking.

From this corpus, we release the CORE benchmark, a general-domain evaluation set targeting 24 semantic relation types, with explicit balance between relational and unrelated concept pairs. The benchmark contains 203 MCQs (102 open-choice, 101 blind-choice) designed to evaluate pairwise relational and analogical reasoning. Each instance includes a gold-standard answer and a human-authored explanation. Relational instances are uniformly distributed across ontological and sense-level categories to prevent skew toward specific relation types.

Question Format

Each question follows the analogical reasoning format: a reference concept pair (A:B) and an incomplete target pair (C:?), with four completion options. Questions employ everyday vocabulary, enabling evaluation of general semantic reasoning.

Example Questions

Questions 1-2 of 24
agent-instrument

Farmer is to tractor as mechanic is to?

antonymy-complementary

Awake is to asleep as correct is to?

Human Validation and Baseline

Answers and explanations for the CORE benchmark were initially developed for 250 questions and validated through a three-pass expert review process. The final benchmark comprises the 203 questions for which perfect inter-annotator agreement was achieved (Cohen’s κ = 1.0). Each question includes human-authored explanations of why the correct answer instantiates the target relationship. Our annotation and validation process follows best practices, with perfect inter-annotator agreement ensuring ground truth reliability.

Subsequently, a human baseline was constructed using responses from over 1,000 participants in India, spanning undergraduate to postdoctoral education levels, who completed the benchmark under blind evaluation conditions. We see overall 92.6% accuracy from human baseline.

Metrics

We evaluate models using accuracy, Expected Calibration Error (ECE), Overconfidence Error Rate, and Semantic Collapse Rate, capturing both correctness and the alignment between confidence and reasoning quality.

Results

We evaluate 29 state-of-the-art LLMs with cutoff date January 22, 2026. Our evaluation covers models from all major developers including GPT series, Llama family and Claude models. All models were evaluated using a uniform prompting and evaluation protocol to ensure fair comparison.

Confidence Accuracy Inversion

Overall accuracy ranges from ≈48.25–70.9% across 29 models. However, this masks a dramatic bifurcation. Despite 40–80 percentage point accuracy differences, models give near identical confidence on both related and unrelated tasks (related: ≈93–95%, unrelated: ≈91–94%). This confidence–accuracy inversion undermines the utility of confidence as a reliability signal, preventing downstream decision systems from appropriately weighting model outputs.

On the full 225K-question CORE dataset, model accuracy drops to approximately 2% across domains, reflecting further degradation on domain-specific relational reasoning.

Calibration

We see degradation from 2-4x ECE increase in unrelated pairs. Overconfidence Error Rate (errors with confidence ≥0.75) ranges from 29.1–51.75%, meaning one-third to one-half of errors on unrelated pairs occur with high confidence, making them particularly dangerous in deployment contexts.

Model Frontier

Trade-off between accuracy and confidence calibration. Upper-left = reliable models.

Robustness Analysis

Semantic robustness: Models above the line distinguish related from unrelated concepts better.

Semantic Collapse

Semantic collapse rate (proportion of unrelated pairs misclassified as having relations) averages 37.6% across models, substantially lower than random guessing's expected 75% error. This indicates models do not fail through random guessing but through systematic generation of false relational structures.

Analysis

Across all 29 models, we observe a universal and consistent asymmetry: near-perfect performance on recognized relations and systematic failure on unrelated pairs. Median accuracy exceeds 97% for standard relations but drops to 11% for unrelated pairs. Notably, models express nearly identical confidence for both cases (≈92–94%), despite accuracy differences exceeding 40–80 percentage points, resulting in severe confidence inversion.

Error analysis shows that failures are not random. Instead, models actively generate spurious relational structures, reflected in a mean Semantic Collapse Rate of 37.6%. Performance across individual relation types is otherwise stable and near-ceiling, making the unrelated category a clear outlier. Difficulty stratification further shows that unrelated pairs remain challenging regardless of human-defined difficulty, suggesting a qualitative limitation rather than increased task complexity.

Discussion

CORE isolates unrelatedness reasoning as a distinct and previously under-evaluated capability of LLMs. Despite strong performance on recognized relations, models consistently fail to identify absence of semantic relations while maintaining high confidence. This cross-cutting consistency suggests that the failure reflects a shared limitation in how current models and training pipelines handle relation absence, potentially arising from a combination of architectural inductive biases, task formulation, and the availability of appropriate training signals.

Implications

The combination of high confidence and low accuracy on unrelated pairs presents challenges for deploying language models in reasoning-dependent settings. When models confidently construct spurious semantic relationships, downstream systems may treat unsupported inferences as reliable signals.

In healthcare, such behaviour may surface high-confidence associations driven by confounding rather than causation, potentially influencing clinical decision-making. In financial contexts, models may assign undue significance to coincidental correlations, increasing exposure to risk. Legal and scientific applications face similar concerns, where plausible but incorrect relational reasoning may affect legal arguments or research prioritization.

Contribution

CORE provides a benchmark for measuring, diagnosing, and tracking unrelatedness reasoning failures. The results motivate architectural and training strategies that explicitly model relation negation and uncertainty, as well as evaluation protocols that account for confident failure on unrelated inputs.

Work in Progress

Preliminary multilingual experiments indicate even larger performance gaps in non-English languages, motivating extension of CORE to low-resource languages and broader multilingual evaluation.

Preliminary fine-tuning experiments on the full 225K-question CORE dataset show improvements in relational and general reasoning, which we plan to analyze and report in future work.

Limitations

The current benchmark and results are primarily limited to English. The task formulation uses a closed multiple-choice format, which may differ from free-form generation settings. CORE is text-only, and results may not generalize to multimodal reasoning. These represent directions for future research.

Citation

@misc{dwivedi2026corecomprehensiveontologicalrelation,
      title={CORE: Comprehensive Ontological Relation Evaluation for Large Language Models},
      author={Satyam Dwivedi and Sanjukta Ghosh and Shivam Dwivedi and Nishi Kumari and Anil Thakur and Anurag Purushottam and Deepak Alok and Praveen Gatla and Manjuprasad B and Bipasha Patgiri},
      year={2026},
      eprint={2602.06446},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.06446},
}