Jose Hernandez-Orallo's Home Page

José Hernández-Orallo
.
Professor, Valencian Research Institute for Artificial Intelligence
.....................Valencian Graduate School and Research Network of AI
.....................Universitat Politècnica de València
Phone: +34963877007 (Ext:73585), Office: DSIC (1F): 236
Address: Camí de Vera 14, E-46022 València, EU. Email: jorallo@upv.es
.
Senior Research Fellow, Leverhulme Centre for the Future of Intelligence
Research Affiliate, Centre for the Study of Existential Risk
Address: 16 Mill Lane, Cambridge, UK. Email: jh2135@cam.ac.uk
.
Fellow, European Association for AI, Scientific Council: Renault Group, Advisory board: Concordia, tangentic.ai, Arkose.

Highlights
Research
Teams
Teaching
Transfer
Short bio
In the media

Status

Based in Europe. Intermittently between Valencia and Cambridge.

For strategy matters, you can also contact my team's executive assistant, Joe Castellano.

Be informed about everything (important) on AI Evaluation: join our AI Evaluation Digest!

Still under the effects of my book on the Evaluation of Natural and Artificial Intelligence, Cambridge University Press 2017, Prose Award 2018 presented by the Association of American Publishers.

New project "Data Science Benchmark for LLM Agents" at CFI, Cambridge, with Lorenzo Pacchiardi and Lucy Cheke, funded by Open Philanthropy.

Coalescing efforts after this workshop and the related initiative on "Predictable AI" on March 8th, 2023, supported by the FLI, as a member of their AI x-safety community See related paper "Predictable Artificial Intelligence".

Working on and enjoying this project "MT4XAI: Machine Teaching for Explainable AI", Norwegian Research Council, with J.A. Telle, C. Ferri and P. Parviainen.

Working with the OECD on their "AI and the Future of Skills" project and the European Commission (JRC) on the characterisation of Foundation/Frontier Models and GPAI.

Recent Highlights

Fixing AI Evaluation ;-) Paper: General Scales Unlock AI Evaluation with Explanatory and Predictive Power, Platform: Contribute!.

Participating in Measuring AI in the World SFI-DeepMind Event.

Featured on the Wall Street Journal: Why Do AI Chatbots Have Such a Hard Time Admitting 'I Don't Know'?. I don't really know!

Animal AI Environment v3.0 is out: Paper.

Predictaboard: a platform and benchmark to evaluate pairs of LLMs and Assessors predicting their performance.

Supportive of the EU AI Code of Practice. But making suggestions to make it better by providing feedback1 and feedback2.

Identification and analysis of the main AI Evaluation Paradigms.

IJCAI and ECAI 2025 Area Chair.

Open Problems in Machine Unlearning for AI Safety.

Talk at the Centre for Mathematical Sciences, Cambridge: Can Humans Supervise Increasingly Ultracrepidarian AI?

Zhou, L., Schellaert, W., Martinez-Plumed, F., Moros-Daval, Y., Ferri, C., Hernandez-Orallo, J. (2024) "Larger and More Instructable Language Models Become Less Reliable", Nature. Additional Nature coverage. X thread.

Takeaways from the Melting Pot Contest - Outcomes, Insights and Analysis. "Melting Pot Contest: Charting the Future of Generalized Cooperative Intelligence", NeurIPS2024 (D&BT).

Schellaert, W., Martinez-Plumed, F., Hernandez-Orallo, J. (2024) Analysing the Predictability of Language Model Performance.

Massive challenges, massive paper: "Foundational Challenges in Assuring Alignment and Safety of Large Language Models", TMLR. A summary here.

50th anniversary of the European Conference of Artificial Intelligence (ECAI): accepted papers (main track): B. Mehrbakhsh, F. Martinez-Plumed, J. Hernandez-Orallo "Distilling the Effects of Language Model Contamination" and Y. Moros-Daval, F. Martinez-Plumed, J. Hernandez-Orallo "Language Task Difficulty Prediction through LLM-Annotated Meta-Features". Gave the invited Frontiers in AI Talk: "Caveats and Solutions for Characterising General-Purpose AI"

Talk at Indian Space Research Organisation - 10th Sep: "AI Evaluation: Predictive and Explanatory Power".

Visited China in July 2024: Prof Fang Luo at Beijing Normal University in the context of SICSS2024 and ICCPAE2024 and Xing Xie at Microsoft Research Asia in the context of an Accelerating Foundation Models Research Programme project.

Joined the NIST AISI Consortium through VRAIN.

A few recent publications: B. Mehrbakhsh, D. Garigliotti, F. Martinez-Plumed, J. Hernandez-Orallo "Confounders in Instance Variation for the Analysis of Data Contamination", CONDA 2024, R. Fabra-Boluda, C. Ferri, J. Hernandez-Orallo, M.J. Ramirez-Quintana, F. Martínez-Plumed Cracking black-box models: Revealing hidden machine learning techniques behind their predictions, Intelligent Data Analysis. Y. Zhao, J. Hernandez-Orallo The impact of sociality regimes on heterogeneous cooperative-competitive multi-agent reinforcement learning: a study with the predator-prey game", Journal of Experimental & Theoretical Artificial Intelligence.

Steering committe for AISafety Workshop 2024 @ IJCAI2024, following the previous AISafety Workshops 2019, 2020, 2021, 2022 and 2023.

Talk about "Predictable AI" at the IX Imperial College AI initiative, 30 January 2024.

Involved in two tutorials: AAAI2024 on "Measurement Layouts for Capability-oriented AI Evaluation" and EACL2024 "Item Response Theory for Natural Language Processing".

Area chair for ECAI2024 and ECML2024. Action Editor for Machine Learning Journal. Consider submitting your best research to any of these venues!

Paper "Team Formation through an Assessor: Choosing MARL Agents in Pursuit-Evasion Games" at CAIS 2024.

I'm talking too much apparently about Predictable AI, Measurement Layouts, AI Validity and AI Evaluation: Cambridge Psychometrics Centre, CSIC's "The many challenges of artificial intelligence", Workshop on Verifiable and Robust AI, UNESCO IFAP "Artificial Intelligence for Information Accessibility" (AI4IA) Conference, Alan Turing Institute workshop on Evaluating Foundation/Frontier Models.

Our RECoG-AI project and our work on AI evaluation featured by Nature ("A test of artificial intelligence"), as part of this Nature Outlook on AI.

New preprints (warning: not all peer-reviewed yet!): "Revealing the structure of language model capabilities", "Evaluating General-Purpose AI with Psychometrics", "An International Consortium for Evaluations of Societal-Scale Risks from Advanced AI" (short version accepted for NeurIPS SOLAR), "Predictable Artificial Intelligence".

Honoured to give the 2023 UPV Inaugural Lecture with title "Artificial and natural intelligence: from diversity to generality": (Slides (in English), Text (in Valencian) and Slides (in English)).

"Rethink reporting of evaluation results in AI: Aggregate metrics and lack of access to results limit understanding" published in Science, 2023. Preprint here.

"Your Prompt is My Command : On Assessing the Human-Centred Generality of Multimodal Models" published in Journal of Artificial Intelligence Research, 2023.

Gave a talk at the Bell Labs, in Cambridge titled "Capability-oriented AI Evaluation: From Measurement Layouts to Validity Predictors". July 2023

"Trends in AI inference energy consumption: Beyond the performance-vs-parameter laws of deep learning" published in Sustainable Computing: Informatics and Systems, 2023.

Many many interviews like "this one on El Pais", because I signed a few letters (this" and "this").

"Heuristic search of optimal machine teaching curricula" published in Machine Learning, 2023.

Papers accepted at ECAI2023, B Mehrbakhsh, F Martinez-Plumed, J Hernandez-Orallo "Adversarial Benchmark Evaluation Rectified by Controlling for Difficulty" and ECML2023, BAT Havardstun, C Ferri, J Hernandez-Orallo, P Parviainen, JA Telle "XAI with Machine Teaching When Humans Are (Not) Informed About the Irrelevant Features".

Gave a talk at the Responsible Artificial Intelligence in the age of big models: Understanding and Evaluating Big Models for Human Intelligence and Learning. Microsoft Research, April 2023

ChatGPT, PISA and the Future of Education, OECD WIPS Conference, 28 March 2023.

Our work redteaming GPT-4 covered by the Financial Times.

One of the zillion co-authors of BigBench "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models". But we do not measure capabilities (only performance) there despite the title! Now accepted for TLMR 2023

Gave a talk at the 18th Annual Conference of the Italian Association of Cognitive Sciences on Dec 16th, Rovereto, Italy.

Our RECOG-AI project covered on Communications of the ACM.

Gave a talk at the School of Computing Colloquim (University of Leeds) on Dec 2nd, 2022, with the title "Performance and Explainability Are Not Enough: Predicting AI Validity"

I gave the talk "Don't Trust Your AI System: Model Its Validity Instead", in the Series of talks on "Trustworthy AI" for the "AI for Good Global Summit", United Nations ITU (International Telecommunication Union). 14 Nov 2022.

Steering the SafeAI Workshops, next one at AAAI2023, following the previous editions in 2019, 2020, 2021 and 2022.

G. Jaimovitch, C. Ferri, J. H. Orallo, F. M-Plumed, M.J. Ramirez "Can language models automate data wrangling?" has been accepted for publication in the Machine Learning Journal.

Keynote speaker: "Instructing prior-aligned machines: programs, examples and prompts", The 2nd International Joint Conference on Learning & Reasoning (IJCLR) Cumberland Lodge, Windsor Great Park, United Kingdom, 28-30 September 2022.

Three papers accepted for IJCAI-ECAI2022: "Not a Number: Identifying Instance Features for Capability-Oriented Evaluation" with R Burnell, J Burden, D Rutar, K Voudouris, L Cheke, "Non-Cheating Teaching Revisited: A New Probabilistic Machine Teaching Model" with C Ferri and JA Telle, and "Measuring the occupational impact of AI: tasks, cognitive abilities and AI benchmarks" with S. Tolan, F. M-Plumed, A. Pesole, E. F-Macias and E. Gomez

IJCAI2022 Survey Track co-Chair (with Peter Flach).

Co-organising the Evaluation Beyond Metrics workshop 2022 @ IJCAI2022.

Co-organising the AISafety Workshop 2022 @ IJCAI2022, following the previous AISafety Workshops 2019, 2020 and 2021.

Paper accepted for ECML/PKDD2022 "Heterogeneity Breaks the Game: Evaluating Cooperation-Competition with Multisets of Agents" with Y. Zhao

Three papers accepted for AAAI2022: "When AI Difficulty is Easy: The Explanatory Power of Predicting IRT Difficulty", with F. M.-Plumed, D. C-Falcon and C. Monserrat, "How General-Purpose Is a Language Model? Usefulness and Safety with Human Prompters in the Wild" with P A M Casares, B. S. Loe, J. Burden and S. O'hEigeartaigh, and a Senior Member Track Paper: "Training on the Test Set: Mapping the System-Problem Space in AI", with W. Schellaert and F. M.-Plumed (Blue Sky Idea Runner-Up Award)

Our paper on "Automating Data Science" with T. De Bie, L. De Raedt, H. H. Hoos, P. Smyth and C. K. I. Williams on the cover of the Communications of the ACM!

Less Recent Highlights

Hernandez-Orallo, J.; Loe, B.S.; Cheke, L.; Martinez-Plumed, F., O h'Eigeartaigh, S. "General Intelligence Disentangled: The Generality of Natural and Artificial Intelligence", Nature Sci Rep 2021.

New chapter: "Identifying artificial intelligence capabilities: What and how to test", in "AI and the Future of Skills, Volume 1: Capabilities and Assessments", OECD Publishing, Paris.

Co-organising the SafeAI Workshops at AAAI 2022 following the previous editions in 2019, 2020 and 2021.

Special Issue Editor on Automating Data Science for the Machine Learning Journal

New NeurIPS2021 paper: "Think Big, Teach Small: Do Language Models Distil Occam's Razor?" with G. Jaimovich, D.C. Falco and C. Ferri

Chapter "Teaching and Explanation: Aligning Priors between Machines and Humans" with C.Ferri in Muggleton, S. and Chater, N. (Eds.) (2021) Human-Like Machine Intelligence. Oxford University Press, and

ECML/PKDD 2021 paper "Optimal Teaching Curricula with Compositional Simplicity Priors" with Manual Garcia-Piqueras.

Participated in a panel at the NIST AI Measurement and Evaluation Workshop in June 2021 and the AI Metrology series in September 2021.

Co-organised ECML/PKDD Workshop on Automating Data Science 2021.

New paper accepted the Journal of Artificial Intelligence Research ("Measuring the occupational impact of AI: tasks, cognitive abilities and AI benchmarks" with S. Tolan, A. Pesole, F. Martinez-Plumed, E. Fernandez-Macias and E. Gomez), 2021.

Co-organised AISafety Workshop 2021 @ IJCAI2021, following the previous AISafety Workshops 2019 and 2020.

New papers accepted for Artificial Intelligence Journal ("Making sense of sensory input" with Richard Evans et al.), Machine Learning Journal ("AUTOMAT[R]IX: learning simple matrix pipelines"), J. of Intelligent Systems ("Missing the missing values: The ugly duckling of fairness in machine learning"), Nature Mat Intell ("Research Community Dynamics behind Popular AI Benchmarks") (see some coverage here) and Telematics and Informatics ("Futures of artificial intelligence through technology readiness levels") 2021.

Invited for the Spanish Senate's Commission on Economic Affairs and Digital Transformation, March 2021.

Paper "Negative Side Effects and AI Agent Indicators: Experiments in SafeLife" at SafeAI Workshop at AAAI 2021.

Participated in the OECD Expert Meeting on Skills and Tests for Assessing AI and Robotics, with this presentation.

New papers accepted for Minds and Machines ("Twenty Years Beyond the Turing Test: Moving Beyond the Human Judges Too") and Expert Systems and Applications ("Learning alternative ways of performing a task").

Animal AI Olympics Paper: "The Animal-AI Testbed and Competition", Proceedings of Machine Learning Research, 2020.

Co-organised the 1st Workshop on Evaluating Progress in AI (EPAI2020) at ECAI 2020.

Four papers accepted for ECAI 2020: "Tracking AI: The Capability is (Not) Near" with F. Martínez-Plumed and E. Gómez, "AI Paradigms and AI Safety: Mapping Artefacts and Techniques to Safety Issues" with F. Martínez-Plumed, Shahar Avin, Jess Whittlestone and Seán O h'Eigeartaigh, "Finite and Confident Teaching in Expectation:Sampling from Infinite Concept Classes" with J.A. Telle and "Family and Prejudice: A Behavioural Taxonomy of Machine Learning Techniques" with the DMIP team.

Read our paper: CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories, IEEE Transactions on Knowledge and Data Engineering journal, 2020.

Read our paper: "Does AI Qualify for the Job? A Bidirectional Model Mapping Labour and AI Intensities", AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society, 2020.

Co-organised the Animal-AI Olympics, 2019.

Read: Journal of Artificial Intelligence Research: "AI Generality and Spearman's Law of Diminishing Returns", 2019.

Contributing to the AI Safety Landscape.

Gave a talk at the Cambridge Science Festival, 2019.

Measure for measure column: "Unbridled mental power", Nature Physics, vol. 15, 2019.

Read our paper Item Response Theory in AI: Analysing Machine Learning Classifiers at the Instance Level", Artificial Intelligence Journal, 2019.

RESEARCH

One of the great scientific challenges of this century is to understand what intelligence is and how it can be recreated. My bit is, on one hand, the evaluation and measurement of intelligent systems in general and machine learning in particular and, on the other hand, some more applied research on data science, data mining and inductive programming. However, I'm interested in many other things, and my publication profiles below can give a better account of what my research really looks like:

Google Scholar Profile.
Publication list (with preprints). Not always up-to-date.

Here you also have a selection of some tutorials and presentations:

"Don't Trust Your AI System: Model Its Validity Instead", Series of talks on "Trustworthy AI" for the "AI for Good Global Summit", United Nations ITU (International Telecommunication Union). 14 Nov 2022.
"Instructing prior-aligned machines: programs, examples and prompts", The 2nd International Joint Conference on Learning & Reasoning (IJCLR) Cumberland Lodge, Windsor Great Park, United Kingdom, 28-30 September 2022.
"Estud-IA-ntes": CRUE V Jornadas FOLTE: Hacia una nueva realidad de las tecnologías educativas. IA y enseñanza híbrida: 7 July 2022,
"Capability Oriented Evaluation (of AI)", Bristol AI Summer Day, 30 June 2022, University of Bristol.
"Calibrating expectations about AI: A renewed endeavour towards the measurement of behaviour", Aachen: Kate Hamburger Kolleg Aachen: Cultures of Research, Lecture Series - Philosophy of AI, 18 May 2022
"Instructing machines effectively", 21.04.2022 - 12.45-13.30, Large auditorium, 2nd floor, Hoyteknologisenteret, University of Bergen, Distinguished Lecture
"Benchmarks and competitions: How do they help us evaluate AI?" at OECD AI WIPS : The Second International Conference on AI in Work, Innovation, Productivity and Skills, 22 Feb 2022.
"Evaluating the capabilities and generality in artificial intelligence" SIG-AGI Japanese society, 26th November, 2021
"Dissecting the Grail of General Intelligence: Measuring the Generality of Natural and Artificial Intelligence" Artificial vs. Natural Intelligence: New Perspectives from Metaphysics, Ethics, and Robotics. Turin, October 29, 2021
US National Institute of Standards and Technology, "Measuring capabilities and generality in artificial intelligence", NIST AI Metrology presentation Series, 9 September 2021
"The evaluation of artificial, hybrid and collective intelligence", Summer Course UIMP, Repensando los Fundamentos de la Inteligencia Artificial, 12 July 2021
"The Generality of Natural and Artificial Intelligence: Task Difficulty as the Elephant in the Room" Science of Intelligence 2021, Berlin, 17 June 2021
"Future Intelligence Testing: Turing's Still Playing", Ciclo de Conferencias, Universidad de Salamanca: 14 May 2021
"Assessing AI Capabilities: Too Important to Be Fooled" at OECD International Conference on AI in Work, Innovation, Productivity and Skills (AI-WIPS): 1-5 February 2021, Panel on exploring assessments of AI capabilities: 4 February 2021, AI-WIPS
Identifying Artificial Intelligence Capabilities: What and How to Test, OECD Expert Meeting on Skills and Tests for Assessing AI and Robotics, 5-6 October 2020.
Measuring Intelligence Incrementally: Search, Demonstration and Transmission, DARPA Machine Common Sense PI Meeting, July 21-23, 2020.
AI Paradigms and AI Safety: Mapping Artefacts and Techniques to Safety Issues, European Conference on Artificial Intelligence, Aug 29 - Sep 8, 2020.
On Broken Yardsticks and Measurement Scales, (paper), MetaEval@AAAI2020, New York, Feb 8, 2020.
Surveying Safety-Relevant AI Characteristics, SafeAI2019 AAAI Workshop, Honolulu, 27 January 2019.
AI Extenders: The Ethical and Societal Implications of Humans Cognitively Extended by AI, AAAI / ACM Conference on Artificial Intelligence, Ethics and Society, Honolulu, 27-28 January 2019.
Artificial Intelligence : Present, Future and Beyond, [video], HUMAINT Winter school on AI and its ethical, social, legal and economic impact, JRC Seville, February 4-8th, 2019.
The What and How of AI Evaluation, [video], HUMAINT Winter school on AI and its ethical, social, legal and economic impact, JRC Seville, February 4-8th, 2019.
Artificial Intelligence and Data Science : Looking Ahead, Springer Nature Conference on Artificial Intelligence in Biomedicine, Madrid 7th February 2019.
Diversity Unites Intelligence: Measuring Generality at Varieties of Minds 2018 in Cambridge, 2018
Measuring A(G)I Right: Some Theoretical and Practical Considerations at DeepMind, London, 2018
Natural and Artificial Intelligence: Measures, Maps and Taxonomies at Clare Hall, Cambridge, 2018
The Mythical Human-Level Machine Intelligence at Philosophy and Theory of AI in Leeds, 2017
FHI, Oxford, Aug 2016, or the very similar CSER, Cambridge, Aug 2016, about the measurement of natural and artificial intelligence
Machine learning performance evaluation: tips and pitfalls at PAPIs Connect, 2016
AGI 2015 Tutorial: Evaluation of Intelligent Systems
Invited Talk at ECML Workshop on Learning over Multiple Contexts 2014
Machine intelligence evaluation: from the Turing Test to the present Day (and beyond), for a Turing centenary celebration

Apart from the recent one on the Evaluation of Natural and Artificial Intelligence, I've published several other books on various topics.

I am collaborating in several national strategies for AI, in the editorial board of the Springer journals Machine Learning and Data Mining and Knowledge Discovery, and have served as area chair or senior PC of IJCAI, AAAI, ECAI, KDD, ECML, NeurIPS and PC member for many others, ICML, CogSci, AGI, ICDM, UAI, ICLR, etc.

TEAMS

Data Mining, Machine Intelligence and Inductive Programming (DMIP), part of the ELP group.

You can see the projects, software and other activities of our group here.
If you're a student or an academic and want to collaborate with us or get involved with our group, please feel free to contact us.
We are part of the Valencian Research Institute for Artificial Intelligence

Kinds of Intelligence Programme, at the Leverhulme Centre for the Future of Intelligence.

I participate in this programme and others at the CFI. Stay tuned!

TEACHING

More than 15,000 students (as for 2020) on my MOOC EDX course on Machine Learning and Data Science (in Spanish)
I currently teach several courses on data science and data mining at the master level.
I also had an undergraduate database course for many years.

TRANSFER

We have had projects, collaborations and visits with several companies in different areas: health, retailing, software development, automotive, ...

Recently, I've been managing two "Cátedras/Aulas de Empresa":

Agreement (cátedra) with BigML on machine learning.
Agreement (aula) with Everis on employability in information technology.

SHORT BIO

José Hernández-Orallo is Professor at the Universitat Politècnica de València, Spain and Senior Research Fellow at the Leverhulme Centre for the Future of Intelligence, University of Cambridge, UK. He received a B.Sc. and a M.Sc. in Computer Science from UPV, partly completed at the École Nationale Supérieure de l'Électronique et de ses Applications (France), and a Ph.D. in Logic and Philosophy of Science with a doctoral extraordinary prize from the University of Valencia. His academic and research activities have spanned several areas of artificial intelligence, machine learning, data science and intelligence measurement, with a focus on a more insightful analysis of the capabilities, generality, progress, impact and risks of artificial intelligence. He has published five books and more than two hundred journal articles and conference papers on these topics. His research in the area of machine intelligence evaluation has been covered by several popular outlets, such as The Economist, New Scientist or Nature. He keeps exploring a more integrated view of the evaluation of natural and artificial intelligence, as vindicated in his book "The Measure of All Minds" (Cambridge University Press, 2017, PROSE Award 2018). He is a member of AAAI, CLAIRE and ELLIS, and a EurAI Fellow.

IN THE MEDIA (and blogs) (not up to date)

Don't take this too seriously:

The Economist, March 3rd 2011
New Scientist Sat 10/Sept/2011, pp42-45.
Alpha Galileo. World's independent source of research news
Eurekalert. Global news service of the AAAS (American Association for the Advancement of Science)
ACM Tech News
Vokrug Sveta (Russia)
BBC News
Nature
Inverse
Financial Times
...

The anYnt project had an extraordinary (and sometimes hilarious) media coverage.

Blogs:

Signatory of the DORA declaration: better ways to evaluate research

Member of AI Existential Safety Community

I support ARKOSE

Signatory of Pause Giant AI Experiments

Signatory of AI Risk

(Copyleft) José Hernández Orallo, 2024.