HeartBench exposes socio-emotional limits in Chinese large language models

Researchers introduced HeartBench, a benchmark that tests how Chinese large language models handle nuanced emotional, cultural, and ethical situations, finding that leading systems reach only 60% of an expert-defined ideal score.

Researchers have introduced HeartBench, a dedicated framework for evaluating anthropomorphic intelligence in large language models focused on the Chinese language, targeting the ability to understand and respond to complex emotional, cultural, and ethical situations rather than simple reasoning or linguistic accuracy. Built from realistic psychological counseling scenarios vetted by clinical experts, HeartBench is designed to probe how well models can navigate subtle human dynamics that are central to applications such as Artificial Intelligence companionship and digital mental health, where misreading emotions or cultural context can have serious impacts. The work positions anthropomorphic intelligence as a core requirement for more human-centric Artificial Intelligence systems and argues that existing benchmarks underrepresent these demands.

HeartBench is constructed around a theory-driven taxonomy that includes five primary dimensions and 15 secondary capabilities, translating abstract traits such as empathy, curiosity, humor, and ethical autonomy into concrete, measurable criteria. The framework uses a rubric-based, “reasoning-before-scoring” protocol, requiring models to generate explicit reasoning that is evaluated against detailed scoring rubrics to capture the nuance of their responses. Researchers assessed 13 state-of-the-art large language models using HeartBench, revealing a substantial performance ceiling, with even the leading models achieving only 60% of the expert-defined ideal score, and further analysis using a difficulty-stratified “Hard Set” of scenarios demonstrates a significant performance decline in situations that involve subtle emotional cues and complex ethical considerations. The results show that current systems often fall back on literal interpretations or generic advice when confronted with ambiguity, indicating that pattern-matching alone is not sufficient for robust socio-emotional intelligence.

More granular analysis highlights specific weaknesses and strengths across models and capabilities. Detailed evaluation reveals that models frequently interpret humor, jokes, and sarcasm literally, struggle with curiosity by offering generic or shallow follow-up questions, and tend toward over-accommodation and didactic communication styles in emotionally charged exchanges, signaling limited sensitivity to user autonomy and boundaries. Measurements indicate that certain models, such as 0.5-Sonnet, demonstrated the most stability across difficulty levels, while Gemini-3-pro-preview experienced a significant drop in performance on the Hard Set, which the authors link to reliance on “social specialization” rather than robust reasoning. Evaluation of thirteen leading language models using HeartBench reveals a significant performance ceiling, with even the most advanced models achieving only 60% of the score defined as ideal by human experts, and the framework demonstrates high reliability, with an 87% agreement rate between automated scores and evaluations from psychological experts, establishing HeartBench as both a standardized metric for anthropomorphic Artificial Intelligence evaluation in Chinese and a methodological blueprint for creating higher quality, human-aligned training data that could be extended to other languages and cultures in future work.

55

Impact Score

Pope Leo frames Artificial Intelligence as a media power struggle

Pope Leo XIV’s first encyclical casts Artificial Intelligence as a moral question of power, labor, and collective responsibility, offering publishers a framework for negotiating with technology companies. The broader media landscape is also shifting as AP supplies election data to ChatGPT, YouTube expands labeling of Artificial Intelligence video, and search traffic declines for publishers.

Why the U.S. leads Europe in Artificial Intelligence adoption

Survey evidence shows U.S. workers and firms are adopting Artificial Intelligence faster than their European counterparts. The gap appears to be driven not only by workforce composition, but also by stronger managerial support and greater workplace encouragement to use the technology.

FluxMem brings dynamic memory to large language model agents

FluxMem reframes memory for large language model agents as a dynamic graph that evolves with feedback, task variation, and long-term use. The approach is designed to reduce the brittleness of static memory systems and improve reliability in complex environments.

Microsoft and NVIDIA hint at N1X Windows 11 launch

Microsoft and NVIDIA signaled a joint Windows 11 push around the N1X, framing it as a new era of PC. The upcoming Arm chip is positioned to bring Copilot+ acceleration and challenge the fastest Windows processors in its class.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.