CORE Logo

Comprehensive Ontological Relation Evaluation (CORE)

Paper
HuggingFaceDataset
load_dataset("vaikhari-ai/core")
GitHubLeaderboard

Satyam Dwivedi1,Nishi Kumari1,Shivam Dwivedi8,Anurag Purushottam8,Sanjukta Ghosh5,Anil Kumar Thakur5,Deepak Alok6,Praveen Gatla2,Manjuprasad B4,Bipasha Patgiri7

1Vaikhari AI, 2BHU, 3Galgotias University, 4GSSSIETW, 5IIT BHU, 6IIT Delhi, 7Tezpur University, 8External Advisor

Aman Gupta 2,Anjali Kumari 2,Ankit Raj Gupta 2,Ankita Keshri2,Bhaskar Singh2,Bipasha Paul2,Chitranshi Tiwari 2,Deepanshu Patel2,Harsh Mishra 2,Himesh Jee Amar2,Kajal Kumari2,Mahi Doshi 2,Muskan Chaudhary 2,Nancy Mittal2,Priyanshu Kumar 2,Rahul Kumar2,Rameesa Azma2,Rasi Shil2,Vineet Kumar 2,Warisha Quatil 2,Subhash Bharti3,Achala C4,Ananya R Naik4,Ananyabm4,Anjali Ajith 4,Ankitha Ks4,Ashitha 4,Ayesha Banu4,B.Tanmayi4,Basavasiri H L 4,Bhagyashree Hokrani 4,Bhoomika.P4,Chinmayi Mohan 4,Dhanalakshmi.N4,Divya Kn4,E.Sai Sruthi4,Hanasi Matada Eshwari 4,Harini Nayaka Gm 4,Harshitha R4,Jeevika Ks 4,Keerthana B4,Lisha S Kumar 4,M.R.Meghana4,Manogna Keshav4,Manya. E. A4,Meghana M4,Megharani4,Mohammed Ayesha Tahreem4,Nanditha M 4,Nithyashree.H 4,Punyashree P R4,R Veena4,Rakshitha M4,Ruchitha S4,Sahana.D.S4,Sai Pallavi 4,Sameeksha S4,Sandhya R 4,Sanika 4,Sanjana G Rao 4,Shafna Ms 4,Sharanya S Prasad 4,Shravya.H4,Shruthi Reddy4,Sinchana4,Sinchana S4,Siri N Murthy4,Siri Patel M 4,Sk Nikitha Reddy 4,Snehaganga N S4,Sowndarya B4,Spoorthi U4,Subhangi Dutta4,Syeda Saneen 4,Tammisetty Harini4,Thanushree M R 4,Thanushree S T 4,Varsha Suresh4,Varshitha S 4,A Vijay Aditya5,Aasish5,Aayush Bhat5,Abhijeet Singh5,Abhishek Chauhan5,Abhishek Kumar Maurya5,Abhyudaya5,Adarsh Kumar Gupta 5,Addagalla Lakshmi Sowjanya5,Adi Akhilesh Singh5,Aditi Gupta5,Aditya Prakash 5,Aditya R Jadhav5,Aditya Raj5,Aditya Singh5,Aishwarya Agnihotri5,Ajay Patel5,Ajay Singh5,Akanksha Singh5,Akshita Ravichndran5,Akula Manasa 5,Allu Deekshita5,Aman Kumar5,Aman Kumar Yadav5,Amardeep Jarwal5,Amit Negi 5,Amit Singh 5,Angraj Shah5,Anjali Kumari5,Anjaneya Raj Garg5,Ankit Prakash5,Ankit Sinha5,Anshu Kumar Ram5,Anshu Yadav5,Anupurba Dhara5,Anurag Kamboj5,Anushka Choudhary5,Arushi Gupta5,Aryan5,Aryan Parihar5,Ashutosh Singh5,Atmadeep Bhattacharya5,Avanish Dhapare5,Awaneesh Kumar Pandey5,Ayush Barot 5,Ayush Kumar5,Ayush Mondal5,Ayush Sharma5,Ayush Tripathi5,Banoth Nandineshwar5,Bellala Mukesh5,Bhanu Verma5,Bhavya Singh 5,Bhupendra Yadav5,Bommadi Mukesh Kumar Reddy5,Brijesh Kumar5,Chaudhary Digvijay Daniel Singh5,Check__Email5,Chelsi Narang5,Chennadi Pavan Sainath Reddy5,Chivukula Sri Eswar Balaji5,Deen Dayal Prajapati5,Deepaprakash K5,Deepjyoti Rabha5,Dhruvi Rajeshbhai Mahyavanshi5,Dipti Gupta5,Divyanshu Yadav5,Durgam Arun Kumar5,Faiz Aman5,Farah Adiba5,Fizaan Khan5,Ganesh Sakkarge5,Ganguly Singh5,Gaurish Maheshwari5,H Poojan5,Happy Kannaujiya5,Harsh5,Harsh Kadiyan5,Harsh Kumar5,Harsh Vardhan5,Harshit Virmani5,Harshita Rajput5,Harshvardhan Goplani5,Hrishabh Deshmukh5,Ishika Saini5,Jagat Jyoti Sarkar5,Jain Aditya Avinash5,Jayesh Sewlani5,Kali Chopra5,Kalyanam Pranay5,Kamal5,Kanukollu Sateesh Kumar 5,Karishma Santani5,Kartikeya Pandey5,Kaushik Kumar5,Kishore Nayak D5,Kolgane Sanskruti Sanjay5,Komal Bhalotia5,Kratika Maheshwari5,Kritarth5,Kshitij Kumar5,Kumar Pundareekaksh5,Kumar Shubham5,Kushagr Kapoor5,Lalit Tolani5,Lalithya C 5,M Balasubramanian5,Madhur Vilas Bahadure5,Malipatel Sravan Kumar Reddy5,Manas Jayaswal5,Manav Gangwar 5,Manish Kumar5,Manisha Bishnoi5,Mannuru Venkateswarlu5,Mayank Agrawal 5,Moksh5,Motilal Bhatra5,Mradul Misra5,Mridul Gupta5,Mrinal Jain5,Naitik Jain5,Nakshatra Shivhare 5,Neelam Meena5,Nida Rahma.A5,Nikhil Deshmukh5,Nikita Gupta5,Nimish Thakur 5,Nitesh Singh5,Nitin Agrawal5,Nitish Kumar5,Ojasvi Tripathi5,Ojaswi Pandey5,Om Abhishek5,Om Shankar5,Omkaran 5,Pakhi Awasthi 5,Parteek Yadav5,Pavani Gupta 5,Pooja Yadav5,Prabhankur5,Prafull Kumar Deepak5,Prakhar Nema5,Pranjali Yadav5,Prashant Nautiyal5,Prathu Tripathi 5,Priti Kumari5,Priyadarshi Annupam5,Priyanshu Tiwari5,Purnima Singh5,Rachit Mittal5,Ragula Eeshareddy5,Rahul Kumar Sonkar5,Rajat Varshney5,Ramgopal Verma5,Raskar Aniket Dattatray5,Ratnesh Kumar Sharma 5,Ravi Kumar5,Rishabh Singh5,Rishabh Yadav5,Rishi Mishra5,Rishi Soni5,Rishit Pal5,Ritesh Soni5,Ritik Rai5,Ritik Raj 5,Ritik Raushan5,Rituraj Barai5,Rohan Sharma5,Rohit Pandey5,Rohit Prasad5,Saarang Kumar5,Sagar Sachan5,Sahil Shekhar5,Saksham Goel5,Samarth Jain5,Samir Kumar 5,Sammit Dhar5,Sampat Meena5,Sapavat Sravan5,Saptarshi Chakraborty5,Sarthak Shewale5,Saurabh Kumar5,Shashank Kumar5,Sheetal Nagar5,Shikha Kaloniya 5,Shivangi Gupta5,Shivansh Gupta5,Shivanshu Kumar5,Shreyam Chaurasia 5,Shreyansh Singh5,Shrija Tiwary5,Shubham Kumar5,Shubham Patel5,Shubhendra Taneja5,Siddhant Bhardwaj5,Siddharth Prakash5,Soham Abhay Kadam5,Sonali Singh5,Sonu Sourabh5,Sourashis Das5,Soustab Haldar5,Sparsh Gupta 5,Srajan Seth5,Srishti Jaiswal5,Sudhanshu Ranjan5,Suharsh Sonkar 5,Suman Kumar5,Sunanda Pandey5,Surkanti Harshitha Reddy5,Sushank 5,Swapnil Wakankar 5,Tanay Ahir5,Tanish Jangir 5,Tanishka Nama5,Tanishq Gupta5,Tarani Mishra5,Tejavath Sudhakar5,Tushar Sarda5,Udeechi Srivastav5,Utkarsh Srivastava5,Vaddadi Lakshmi Sri Sai Srinivas5,Vadithya Rajagopal5,Vaibhav Jain5,Vaibhav Saini5,Vanshika5,Vedant Bhoruka5,Vijay Kumar 5,Vikash Kumar5,Vineet Tyagi5,Vipul Bharti5,Vishal5,Vishisht Dubey5,Vishnu Katara5,Vishvender Pachaar5,Vivek Kumar5,Yash Agarwal5,Yash Sachan5,Aadithya Balachandran6,Abhay6,Aditya Sharma6,Akshay6,Barza A K6,Bhishen Kumar Sahu6,Chinmayee Mohapatra6,Dhairya Yadav6,Divyanka Swarna6,Hariprasad Doley6,Karthika P6,Khushi6,Lugai Kamei6,Manasi Anil Lamsoge6,Mojum Kamduk6,Neeraj N Shetty6,Panduru Tanisha6,Rohit B Sharma6,Sai Sudeep Das6,Sara Singh6,Sharon Valui6,Sheersha Roy6,Shivang Jaiswal 6,Shweta Umrankar6,Soumya Jain6,Sumayya Ayesha6,Suvrojit Nath6,Tanisha6,Vanshika Gupta6,Zitaksyor Sonowal6,Mohima Narzary7,Pratiksha Rabha7,Ruba Das7,Shruti Dekaraja7,Yuktashree Hazarika7

Affiliations

2 BHU3 Galgotias University4 GSSSIETW5 IIT BHU

Introduction

Large Language Models (LLMs) have demonstrated strong performance across a broad range of reasoning benchmarks. However, existing evaluations largely emphasize recognition of a limited subset of semantic relations, leaving two fundamental capabilities insufficiently examined: (a) robust reasoning across a comprehensive set of sense-level relations, and (b) reliable identification of cases in which no meaningful semantic relation exists between concepts. Both are essential for dependable reasoning, yet the latter, in particular, remains largely unmeasured.

We introduce Comprehensive Ontological Relation Evaluation (CORE), a large-scale dataset comprising 225K multiple-choice questions (MCQs) spanning 74 disciplines, including STEM, Humanities, and Social Sciences. A portion of CORE is released for fine-tuning and instruction tuning, while the remainder is reserved for evaluation to mitigate contamination and overfitting. From this larger corpus, we release a general-domain evaluation subset, referred to as the CORE benchmark, designed to evaluate LLMs across 24 semantic relation types with equal representation of related and unrelated concept pairs.

The CORE benchmark consists of 203 rigorously validated MCQs, each accompanied by human-authored explanations. A large-scale human baseline constructed from responses by over 1,000 participants achieves 92.6% accuracy, establishing a strong empirical reference. In contrast, 29 state-of-the-art models achieve only 48.25–70.9% overall accuracy, despite near-ceiling performance on related pairs (89.4–100%) and severe degradation on unrelated pairs (0–41.35%). These results indicate that failures in unrelatedness reasoning substantially constrain overall model reliability. CORE enables systematic measurement of this limitation and supports analysis of semantic reasoning robustness.

Dataset

CORE is a large-scale dataset comprising 225K MCQs across 74 academic disciplines, constructed to support training, evaluation, and diagnostic analysis of relational reasoning. To reduce evaluation contamination, only a subset is released for fine-tuning, while the remainder is reserved exclusively for benchmarking.

From this corpus, we release the CORE benchmark, a general-domain evaluation set targeting 24 semantic relation types, with explicit balance between relational and unrelated concept pairs.

The benchmark contains 203 MCQs (102 open-choice, 101 blind-choice) designed to evaluate pairwise relational and analogical reasoning. Each instance includes a gold-standard answer and a human-authored explanation. Relational instances are uniformly distributed across ontological and sense-level categories to prevent skew toward specific relation types.

Question Format

Each question consists of a reference concept pair and an incomplete target pair, completed by selecting the most appropriate option from four candidates. All items use everyday, non-domain-specific concepts, enabling evaluation of general semantic reasoning rather than specialized knowledge.

Example Questions

Questions 1-2 of 24
agent-instrument

Farmer is to tractor as mechanic is to?

antonymy-complementary

Awake is to asleep as correct is to?

Human Validation and Baseline

Answers and explanations were initially created for 250 questions and validated through a three-pass expert review process. The final benchmark includes the 203 questions for which perfect inter-annotator agreement was achieved (Cohen’s κ = 1.0).

A human baseline was constructed using responses from over 1,000 participants across India, spanning undergraduate to postdoctoral education levels and diverse demographic backgrounds. Human performance reaches 92.6% accuracy overall, with 90.2% accuracy on related pairs and 95.1% accuracy on unrelated pairs, indicating that recognizing unrelatedness is not inherently difficult for humans.

Metrics

We evaluate models using accuracy, Expected Calibration Error (ECE), Overconfidence Error Rate, and Semantic Collapse Rate, capturing both correctness and the alignment between confidence and reasoning quality.

Results

We evaluate 29 large language models, including major frontier models available as of January 15, 2025, on the CORE benchmark. All models are assessed using a uniform prompting and evaluation protocol, in which each model selects an answer option for each multiple-choice question and reports associated confidence-related outputs.

Model Performance and Calibration

Across evaluated models, overall accuracy ranges from 48.25–70.9%, with 45.5–67.3% on the open setting and 51.0–74.5% on the blind setting. Calibration metrics reveal substantial misalignment between confidence and correctness, with ECE ranging from 24.4–51.1% and Overconfidence Error Rate from 29.1–51.75%.

Stratification by relation type reveals a pronounced asymmetry. Accuracy on related pairs remains high (86.5–100%), while performance on unrelated pairs collapses (0–48% blind, 0–34.7% open). Joint analysis of accuracy and calibration indicates that most models exhibit unstable confidence behavior, with only a small subset approaching reliable calibration.

On the full 225K-question CORE dataset, model accuracy drops to approximately 2% across domains, reflecting further degradation on domain-specific relational reasoning.

Model Frontier

Trade-off between accuracy and confidence calibration. Upper-left = reliable models.

Robustness Analysis

Semantic robustness: Models above the line distinguish related from unrelated concepts better.

Analysis

Across all 29 models, we observe a universal and consistent asymmetry: near-perfect performance on recognized relations and systematic failure on unrelated pairs. Median accuracy exceeds 97% for standard relations but drops to 11% for unrelated pairs. Notably, models express nearly identical confidence for both cases (≈92–94%), despite accuracy differences exceeding 40–80 percentage points, resulting in severe confidence inversion.

Calibration degrades sharply on unrelated pairs, with ECE increasing 2–4× relative to related pairs across all models. Even the best-calibrated models exhibit this breakdown. Approximately 40% of errors on unrelated pairs occur with high confidence, rendering confidence signals unreliable precisely where uncertainty awareness is most critical.

Error analysis shows that failures are not random. Instead, models actively generate spurious relational structures, reflected in a mean Semantic Collapse Rate of 37.6%. Performance across individual relation types is otherwise stable and near-ceiling, making the unrelated category a clear outlier. Difficulty stratification further shows that unrelated pairs remain challenging regardless of human-defined difficulty, suggesting a qualitative limitation rather than increased task complexity.

Discussion

CORE isolates unrelatedness reasoning as a distinct and under-evaluated capability of LLMs. Despite strong performance on recognized relations, models consistently fail to identify the absence of semantic relations, while maintaining high confidence. This failure is accompanied by systematic miscalibration and structured generation of false relationships.

The consistency of this behavior across model families, scales, and training sources suggests a shared limitation. Current transformer-based LLMs appear to lack explicit mechanisms for representing relation absence or negation, resulting in confident but unsupported inferences.

Implications

Failures in unrelatedness reasoning pose risks for deployment in reasoning-dependent applications. High-confidence inference of non-existent relations can mislead downstream decision-making, particularly in high-stakes domains.

Across healthcare, finance, law, and scientific discovery, this failure mode manifests not as factual hallucination but as confident construction of false relationships, which is subtler and harder to detect. The combination of low accuracy and high confidence on unrelated inputs creates deployment risks that are not captured by standard benchmarks.

Contribution

CORE provides a benchmark for measuring, diagnosing, and tracking unrelatedness reasoning failures. The results motivate architectural and training strategies that explicitly model relation negation and uncertainty, as well as evaluation protocols that account for confident failure on unrelated inputs.

Future Work

Preliminary fine-tuning experiments on a subset of large CORE dataset show improvements in relational and general reasoning, which we plan to report in future work. Early multilingual experiments indicate even larger performance gaps, motivating extension of CORE to low-resource languages and broader multilingual evaluation.

Limitations

The current benchmark and results are primarily limited to English. The task formulation uses a closed multiple-choice format, which may differ from free-form generation settings. CORE is text-only, and results may not generalize to multimodal reasoning. These represent directions for future research.

Citation

@misc{core2026,
  title={CORE: Comprehensive Ontological Relation Evaluation},
  author={Satyam Dwivedi, Nishi Kumari, Shivam Dwivedi, Anurag Purushottam, Sanjukta Ghosh, Anil Thakur, Deepak Alok, Praveen Gatla, Manjuprasad B, Bipasha Patgiri},
  year={2026},
  publisher={Vaikhari AI},
  url={https://vaikhari.ai/core}
}