Comprehensive Ontological Relation Evaluation (CORE)

load_dataset("vaikhari-ai/core")

Satyam Dwivedi¹,Sanjukta Ghosh²,Shivam Dwivedi²,Nishi Kumari¹,Anil Kumar Thakur²,Anurag Purushottam¹,Praveen Gatla³,Deepak Alok⁴,Manjuprasad B⁵,Bipasha Patgiri⁶

¹Vaikhari AI, ²IIT BHU, ³BHU, ⁴IIT Delhi, ⁵GSSSIETW, ⁶Tezpur University

Aman Gupta ¹,Anjali Kumari ¹,Ankit Raj Gupta ¹,Ankita Keshri¹,Bhaskar Singh¹,Bipasha Paul¹,Chitranshi Tiwari ¹,Deepanshu Patel¹,Harsh Mishra ¹,Himesh Jee Amar¹,Kajal Kumari¹,Mahi Doshi ¹,Muskan Chaudhary ¹,Nancy Mittal¹,Priyanshu Kumar ¹,Rahul Kumar¹,Rameesa Azma¹,Rasi Shil¹,Vineet Kumar ¹,Warisha Quatil ¹,Subhash Bharti²,Achala C³,Ananya R Naik³,Ananyabm³,Anjali Ajith ³,Ankitha Ks³,Ashitha ³,Ayesha Banu³,B.Tanmayi³,Basavasiri H L ³,Bhagyashree Hokrani ³,Bhoomika.P³,Chinmayi Mohan ³,Dhanalakshmi.N³,Divya Kn³,E.Sai Sruthi³,Hanasi Matada Eshwari ³,Harini Nayaka Gm ³,Harshitha R³,Jeevika Ks ³,Keerthana B³,Lisha S Kumar ³,M.R.Meghana³,Manogna Keshav³,Manya. E. A³,Meghana M³,Megharani³,Mohammed Ayesha Tahreem³,Nanditha M ³,Nithyashree.H ³,Punyashree P R³,R Veena³,Rakshitha M³,Ruchitha S³,Sahana.D.S³,Sai Pallavi ³,Sameeksha S³,Sandhya R ³,Sanika ³,Sanjana G Rao ³,Shafna Ms ³,Sharanya S Prasad ³,Shravya.H³,Shruthi Reddy³,Sinchana³,Sinchana S³,Siri N Murthy³,Siri Patel M ³,Sk Nikitha Reddy ³,Snehaganga N S³,Sowndarya B³,Spoorthi U³,Subhangi Dutta³,Syeda Saneen ³,Tammisetty Harini³,Thanushree M R ³,Thanushree S T ³,Varsha Suresh³,Varshitha S ³,A Vijay Aditya⁴,Aasish⁴,Aayush Bhat⁴,Abhijeet Singh⁴,Abhishek Chauhan⁴,Abhishek Kumar Maurya⁴,Abhyudaya⁴,Adarsh Kumar Gupta ⁴,Addagalla Lakshmi Sowjanya⁴,Adi Akhilesh Singh⁴,Aditi Gupta⁴,Aditya Prakash ⁴,Aditya R Jadhav⁴,Aditya Raj⁴,Aditya Singh⁴,Aishwarya Agnihotri⁴,Ajay Patel⁴,Ajay Singh⁴,Akanksha Singh⁴,Akshita Ravichndran⁴,Akula Manasa ⁴,Allu Deekshita⁴,Aman Kumar⁴,Aman Kumar Yadav⁴,Amardeep Jarwal⁴,Amit Negi ⁴,Amit Singh ⁴,Angraj Shah⁴,Anjali Kumari⁴,Anjaneya Raj Garg⁴,Ankit Prakash⁴,Ankit Sinha⁴,Anshu Kumar Ram⁴,Anshu Yadav⁴,Anupurba Dhara⁴,Anurag Kamboj⁴,Anushka Choudhary⁴,Arushi Gupta⁴,Aryan⁴,Aryan Parihar⁴,Ashutosh Singh⁴,Atmadeep Bhattacharya⁴,Avanish Dhapare⁴,Awaneesh Kumar Pandey⁴,Ayush Barot ⁴,Ayush Kumar⁴,Ayush Mondal⁴,Ayush Sharma⁴,Ayush Tripathi⁴,Banoth Nandineshwar⁴,Bellala Mukesh⁴,Bhanu Verma⁴,Bhavya Singh ⁴,Bhupendra Yadav⁴,Bommadi Mukesh Kumar Reddy⁴,Brijesh Kumar⁴,Chaudhary Digvijay Daniel Singh⁴,Check__Email⁴,Chelsi Narang⁴,Chennadi Pavan Sainath Reddy⁴,Chivukula Sri Eswar Balaji⁴,Deen Dayal Prajapati⁴,Deepaprakash K⁴,Deepjyoti Rabha⁴,Dhruvi Rajeshbhai Mahyavanshi⁴,Dipti Gupta⁴,Divyanshu Yadav⁴,Durgam Arun Kumar⁴,Faiz Aman⁴,Farah Adiba⁴,Fizaan Khan⁴,Ganesh Sakkarge⁴,Ganguly Singh⁴,Gaurish Maheshwari⁴,H Poojan⁴,Happy Kannaujiya⁴,Harsh⁴,Harsh Kadiyan⁴,Harsh Kumar⁴,Harsh Vardhan⁴,Harshit Virmani⁴,Harshita Rajput⁴,Harshvardhan Goplani⁴,Hrishabh Deshmukh⁴,Ishika Saini⁴,Jagat Jyoti Sarkar⁴,Jain Aditya Avinash⁴,Jayesh Sewlani⁴,Kali Chopra⁴,Kalyanam Pranay⁴,Kamal⁴,Kanukollu Sateesh Kumar ⁴,Karishma Santani⁴,Kartikeya Pandey⁴,Kaushik Kumar⁴,Kishore Nayak D⁴,Kolgane Sanskruti Sanjay⁴,Komal Bhalotia⁴,Kratika Maheshwari⁴,Kritarth⁴,Kshitij Kumar⁴,Kumar Pundareekaksh⁴,Kumar Shubham⁴,Kushagr Kapoor⁴,Lalit Tolani⁴,Lalithya C ⁴,M Balasubramanian⁴,Madhur Vilas Bahadure⁴,Malipatel Sravan Kumar Reddy⁴,Manas Jayaswal⁴,Manav Gangwar ⁴,Manish Kumar⁴,Manisha Bishnoi⁴,Mannuru Venkateswarlu⁴,Mayank Agrawal ⁴,Moksh⁴,Motilal Bhatra⁴,Mradul Misra⁴,Mridul Gupta⁴,Mrinal Jain⁴,Naitik Jain⁴,Nakshatra Shivhare ⁴,Neelam Meena⁴,Nida Rahma.A⁴,Nikhil Deshmukh⁴,Nikita Gupta⁴,Nimish Thakur ⁴,Nitesh Singh⁴,Nitin Agrawal⁴,Nitish Kumar⁴,Ojasvi Tripathi⁴,Ojaswi Pandey⁴,Om Abhishek⁴,Om Shankar⁴,Omkaran ⁴,Pakhi Awasthi ⁴,Parteek Yadav⁴,Pavani Gupta ⁴,Pooja Yadav⁴,Prabhankur⁴,Prafull Kumar Deepak⁴,Prakhar Nema⁴,Pranjali Yadav⁴,Prashant Nautiyal⁴,Prathu Tripathi ⁴,Priti Kumari⁴,Priyadarshi Annupam⁴,Priyanshu Tiwari⁴,Purnima Singh⁴,Rachit Mittal⁴,Ragula Eeshareddy⁴,Rahul Kumar Sonkar⁴,Rajat Varshney⁴,Ramgopal Verma⁴,Raskar Aniket Dattatray⁴,Ratnesh Kumar Sharma ⁴,Ravi Kumar⁴,Rishabh Singh⁴,Rishabh Yadav⁴,Rishi Mishra⁴,Rishi Soni⁴,Rishit Pal⁴,Ritesh Soni⁴,Ritik Rai⁴,Ritik Raj ⁴,Ritik Raushan⁴,Rituraj Barai⁴,Rohan Sharma⁴,Rohit Pandey⁴,Rohit Prasad⁴,Saarang Kumar⁴,Sagar Sachan⁴,Sahil Shekhar⁴,Saksham Goel⁴,Samarth Jain⁴,Samir Kumar ⁴,Sammit Dhar⁴,Sampat Meena⁴,Sapavat Sravan⁴,Saptarshi Chakraborty⁴,Sarthak Shewale⁴,Saurabh Kumar⁴,Shashank Kumar⁴,Sheetal Nagar⁴,Shikha Kaloniya ⁴,Shivangi Gupta⁴,Shivansh Gupta⁴,Shivanshu Kumar⁴,Shreyam Chaurasia ⁴,Shreyansh Singh⁴,Shrija Tiwary⁴,Shubham Kumar⁴,Shubham Patel⁴,Shubhendra Taneja⁴,Siddhant Bhardwaj⁴,Siddharth Prakash⁴,Soham Abhay Kadam⁴,Sonali Singh⁴,Sonu Sourabh⁴,Sourashis Das⁴,Soustab Haldar⁴,Sparsh Gupta ⁴,Srajan Seth⁴,Srishti Jaiswal⁴,Sudhanshu Ranjan⁴,Suharsh Sonkar ⁴,Suman Kumar⁴,Sunanda Pandey⁴,Surkanti Harshitha Reddy⁴,Sushank ⁴,Swapnil Wakankar ⁴,Tanay Ahir⁴,Tanish Jangir ⁴,Tanishka Nama⁴,Tanishq Gupta⁴,Tarani Mishra⁴,Tejavath Sudhakar⁴,Tushar Sarda⁴,Udeechi Srivastav⁴,Utkarsh Srivastava⁴,Vaddadi Lakshmi Sri Sai Srinivas⁴,Vadithya Rajagopal⁴,Vaibhav Jain⁴,Vaibhav Saini⁴,Vanshika⁴,Vedant Bhoruka⁴,Vijay Kumar ⁴,Vikash Kumar⁴,Vineet Tyagi⁴,Vipul Bharti⁴,Vishal⁴,Vishisht Dubey⁴,Vishnu Katara⁴,Vishvender Pachaar⁴,Vivek Kumar⁴,Yash Agarwal⁴,Yash Sachan⁴,Aadithya Balachandran⁵,Abhay⁵,Aditya Sharma⁵,Akshay⁵,Barza A K⁵,Bhishen Kumar Sahu⁵,Chinmayee Mohapatra⁵,Dhairya Yadav⁵,Divyanka Swarna⁵,Hariprasad Doley⁵,Karthika P⁵,Khushi⁵,Lugai Kamei⁵,Manasi Anil Lamsoge⁵,Mojum Kamduk⁵,Neeraj N Shetty⁵,Panduru Tanisha⁵,Rohit B Sharma⁵,Sai Sudeep Das⁵,Sara Singh⁵,Sharon Valui⁵,Sheersha Roy⁵,Shivang Jaiswal ⁵,Shweta Umrankar⁵,Soumya Jain⁵,Sumayya Ayesha⁵,Suvrojit Nath⁵,Tanisha⁵,Vanshika Gupta⁵,Zitaksyor Sonowal⁵,Mohima Narzary⁶,Pratiksha Rabha⁶,Ruba Das⁶,Shruti Dekaraja⁶,Yuktashree Hazarika⁶

Affiliations

¹ BHU² Galgotias University³ GSSSIETW⁴ IIT BHU⁵ IIT Delhi⁶ Tezpur University

Introduction

Large Language Models (LLMs) have demonstrated strong performance across a broad range of reasoning benchmarks. However, existing evaluations largely emphasize recognition of a limited subset of semantic relations, leaving two fundamental capabilities insufficiently examined: (a) robust reasoning across a comprehensive set of sense-level relations, and (b) reliable identification of cases in which no meaningful semantic relation exists between concepts. Both are essential for dependable reasoning, yet the latter, in particular, remains largely unmeasured.

We introduce Comprehensive Ontological Relation Evaluation (CORE), a large-scale dataset comprising 225K multiple-choice questions (MCQs) spanning 74 disciplines, including STEM, Humanities, and Social Sciences. A portion of CORE is released for fine-tuning and instruction tuning, while the remainder is reserved for evaluation to mitigate contamination and overfitting. From this larger corpus, we release a general-domain evaluation subset, referred to as the CORE benchmark, designed to evaluate LLMs across 24 semantic relation types with equal representation of related and unrelated concept pairs.

The CORE benchmark consists of 203 rigorously validated MCQs, each accompanied by human-authored explanations. A large-scale human baseline constructed from responses by over 1,000 participants achieves 92.6% accuracy, establishing a strong empirical reference. In contrast, 29 state-of-the-art models achieve only 48.25–70.9% overall accuracy, despite near-ceiling performance on related pairs (89.4–100%) and severe degradation on unrelated pairs (0–41.35%). These results indicate that failures in unrelatedness reasoning substantially constrain overall model reliability. CORE enables systematic measurement of this limitation and supports analysis of semantic reasoning robustness.

Dataset

CORE is a large-scale dataset comprising 225K MCQs across 74 academic disciplines, constructed to support training, evaluation, and diagnostic analysis of relational reasoning. To reduce evaluation contamination, only a subset is released for fine-tuning, while the remainder is reserved exclusively for benchmarking.

From this corpus, we release the CORE benchmark, a general-domain evaluation set targeting 24 semantic relation types, with explicit balance between relational and unrelated concept pairs. The benchmark contains 203 MCQs (102 open-choice, 101 blind-choice) designed to evaluate pairwise relational and analogical reasoning. Each instance includes a gold-standard answer and a human-authored explanation. Relational instances are uniformly distributed across ontological and sense-level categories to prevent skew toward specific relation types.

Question Format

Each question follows the analogical reasoning format: a reference concept pair (A:B) and an incomplete target pair (C:?), with four completion options. Questions employ everyday vocabulary, enabling evaluation of general semantic reasoning.

Example Questions

Questions 1-2 of 24

agent-instrument

Farmer is to tractor as mechanic is to?

antonymy-complementary

Awake is to asleep as correct is to?

Human Validation and Baseline

Answers and explanations for the CORE benchmark were initially developed for 250 questions and validated through a three-pass expert review process. The final benchmark comprises the 203 questions for which perfect inter-annotator agreement was achieved (Cohen’s κ = 1.0). Each question includes human-authored explanations of why the correct answer instantiates the target relationship. Our annotation and validation process follows best practices, with perfect inter-annotator agreement ensuring ground truth reliability.

Subsequently, a human baseline was constructed using responses from over 1,000 participants in India, spanning undergraduate to postdoctoral education levels, who completed the benchmark under blind evaluation conditions. We see overall 92.6% accuracy from human baseline.

Metrics

We evaluate models using accuracy, Expected Calibration Error (ECE), Overconfidence Error Rate, and Semantic Collapse Rate, capturing both correctness and the alignment between confidence and reasoning quality.

Results

We evaluate 29 state-of-the-art LLMs with cutoff date January 22, 2026. Our evaluation covers models from all major developers including GPT series, Llama family and Claude models. All models were evaluated using a uniform prompting and evaluation protocol to ensure fair comparison.

Confidence Accuracy Inversion

Overall accuracy ranges from ≈48.25–70.9% across 29 models. However, this masks a dramatic bifurcation. Despite 40–80 percentage point accuracy differences, models give near identical confidence on both related and unrelated tasks (related: ≈93–95%, unrelated: ≈91–94%). This confidence–accuracy inversion undermines the utility of confidence as a reliability signal, preventing downstream decision systems from appropriately weighting model outputs.

On the full 225K-question CORE dataset, model accuracy drops to approximately 2% across domains, reflecting further degradation on domain-specific relational reasoning.

Calibration

We see degradation from 2-4x ECE increase in unrelated pairs. Overconfidence Error Rate (errors with confidence ≥0.75) ranges from 29.1–51.75%, meaning one-third to one-half of errors on unrelated pairs occur with high confidence, making them particularly dangerous in deployment contexts.

Model Frontier

Trade-off between accuracy and confidence calibration. Upper-left = reliable models.

Robustness Analysis

Semantic robustness: Models above the line distinguish related from unrelated concepts better.

Semantic Collapse

Semantic collapse rate (proportion of unrelated pairs misclassified as having relations) averages 37.6% across models, substantially lower than random guessing's expected 75% error. This indicates models do not fail through random guessing but through systematic generation of false relational structures.

Analysis

Across all 29 models, we observe a universal and consistent asymmetry: near-perfect performance on recognized relations and systematic failure on unrelated pairs. Median accuracy exceeds 97% for standard relations but drops to 11% for unrelated pairs. Notably, models express nearly identical confidence for both cases (≈92–94%), despite accuracy differences exceeding 40–80 percentage points, resulting in severe confidence inversion.

Error analysis shows that failures are not random. Instead, models actively generate spurious relational structures, reflected in a mean Semantic Collapse Rate of 37.6%. Performance across individual relation types is otherwise stable and near-ceiling, making the unrelated category a clear outlier. Difficulty stratification further shows that unrelated pairs remain challenging regardless of human-defined difficulty, suggesting a qualitative limitation rather than increased task complexity.

Discussion

CORE isolates unrelatedness reasoning as a distinct and previously under-evaluated capability of LLMs. Despite strong performance on recognized relations, models consistently fail to identify absence of semantic relations while maintaining high confidence. This cross-cutting consistency suggests that the failure reflects a shared limitation in how current models and training pipelines handle relation absence, potentially arising from a combination of architectural inductive biases, task formulation, and the availability of appropriate training signals.

Implications

The combination of high confidence and low accuracy on unrelated pairs presents challenges for deploying language models in reasoning-dependent settings. When models confidently construct spurious semantic relationships, downstream systems may treat unsupported inferences as reliable signals.

In healthcare, such behaviour may surface high-confidence associations driven by confounding rather than causation, potentially influencing clinical decision-making. In financial contexts, models may assign undue significance to coincidental correlations, increasing exposure to risk. Legal and scientific applications face similar concerns, where plausible but incorrect relational reasoning may affect legal arguments or research prioritization.

Contribution

CORE provides a benchmark for measuring, diagnosing, and tracking unrelatedness reasoning failures. The results motivate architectural and training strategies that explicitly model relation negation and uncertainty, as well as evaluation protocols that account for confident failure on unrelated inputs.

Work in Progress

Preliminary multilingual experiments indicate even larger performance gaps in non-English languages, motivating extension of CORE to low-resource languages and broader multilingual evaluation.

Preliminary fine-tuning experiments on the full 225K-question CORE dataset show improvements in relational and general reasoning, which we plan to analyze and report in future work.

Limitations

The current benchmark and results are primarily limited to English. The task formulation uses a closed multiple-choice format, which may differ from free-form generation settings. CORE is text-only, and results may not generalize to multimodal reasoning. These represent directions for future research.

Citation

@misc{dwivedi2026corecomprehensiveontologicalrelation,
      title={CORE: Comprehensive Ontological Relation Evaluation for Large Language Models},
      author={Satyam Dwivedi and Sanjukta Ghosh and Shivam Dwivedi and Nishi Kumari and Anil Thakur and Anurag Purushottam and Deepak Alok and Praveen Gatla and Manjuprasad B and Bipasha Patgiri},
      year={2026},
      eprint={2602.06446},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.06446},
}