Researchers have introduced HeartBench, a dedicated framework for evaluating anthropomorphic intelligence in large language models focused on the Chinese language, targeting the ability to understand and respond to complex emotional, cultural, and ethical situations rather than simple reasoning or linguistic accuracy. Built from realistic psychological counseling scenarios vetted by clinical experts, HeartBench is designed to probe how well models can navigate subtle human dynamics that are central to applications such as Artificial Intelligence companionship and digital mental health, where misreading emotions or cultural context can have serious impacts. The work positions anthropomorphic intelligence as a core requirement for more human-centric Artificial Intelligence systems and argues that existing benchmarks underrepresent these demands.
HeartBench is constructed around a theory-driven taxonomy that includes five primary dimensions and 15 secondary capabilities, translating abstract traits such as empathy, curiosity, humor, and ethical autonomy into concrete, measurable criteria. The framework uses a rubric-based, “reasoning-before-scoring” protocol, requiring models to generate explicit reasoning that is evaluated against detailed scoring rubrics to capture the nuance of their responses. Researchers assessed 13 state-of-the-art large language models using HeartBench, revealing a substantial performance ceiling, with even the leading models achieving only 60% of the expert-defined ideal score, and further analysis using a difficulty-stratified “Hard Set” of scenarios demonstrates a significant performance decline in situations that involve subtle emotional cues and complex ethical considerations. The results show that current systems often fall back on literal interpretations or generic advice when confronted with ambiguity, indicating that pattern-matching alone is not sufficient for robust socio-emotional intelligence.
More granular analysis highlights specific weaknesses and strengths across models and capabilities. Detailed evaluation reveals that models frequently interpret humor, jokes, and sarcasm literally, struggle with curiosity by offering generic or shallow follow-up questions, and tend toward over-accommodation and didactic communication styles in emotionally charged exchanges, signaling limited sensitivity to user autonomy and boundaries. Measurements indicate that certain models, such as 0.5-Sonnet, demonstrated the most stability across difficulty levels, while Gemini-3-pro-preview experienced a significant drop in performance on the Hard Set, which the authors link to reliance on “social specialization” rather than robust reasoning. Evaluation of thirteen leading language models using HeartBench reveals a significant performance ceiling, with even the most advanced models achieving only 60% of the score defined as ideal by human experts, and the framework demonstrates high reliability, with an 87% agreement rate between automated scores and evaluations from psychological experts, establishing HeartBench as both a standardized metric for anthropomorphic Artificial Intelligence evaluation in Chinese and a methodological blueprint for creating higher quality, human-aligned training data that could be extended to other languages and cultures in future work.
