Growing numbers of people are asking generative Artificial Intelligence tools such as ChatGPT whether their symptoms indicate serious conditions like cancer or cardiac arrest, raising concerns about safety and accuracy. A new study published in the journal iScience evaluated how well ChatGPT and related large language models handle biomedical information, focusing on disease terms and three types of associations: drug names, genetics and symptoms. The work was led by Ahmed Abdeen Hamed, a research fellow at Binghamton University’s Thomas J. Watson College of Engineering and Applied Science, in collaboration with researchers from AGH University of Krakow in Poland, Howard University and the University of Vermont.
Hamed previously developed a machine-learning algorithm called xFakeSci that can detect up to 94% of bogus scientific papers, and he framed the new study as a step toward verifying biomedical generative capabilities of large language models. When tested, the Artificial Intelligence showed high accuracy in identifying disease terms (88-97%), drug names (90-91%) and genetic information (88-98%), far exceeding Hamed’s initial expectation that it would reach “at most 25% accuracy.” The system reliably labeled cancer and hypertension as diseases, fever as a symptom, Remdesivir as a drug and BRCA as a gene related to breast cancer, which the researchers described as an impressive outcome given the conversational design of the model.
Performance dropped significantly when the model was asked to identify symptoms, where it scored 49-61%. The researchers suggest this gap stems from how large language models are trained versus how medical knowledge is formally structured. Biomedical experts rely on ontologies to define and organize terms and relationships, while everyday users enter informal, social language when describing their health concerns. Hamed noted that ChatGPT uses friendly phrasing to communicate with average people and appears to simplify or “minimize the formalities of medical language” for symptoms in response to heavy user traffic. A more serious flaw emerged in tests involving genetic data from the National Institutes of Health’s GenBank database, which assigns accession numbers like NM_007294.4 for the Breast Cancer 1 gene (BRCA1). When prompted for these identifiers, the model simply made them up, a hallucination Hamed regards as a major failure that must be addressed. He argues that integrating biomedical ontologies directly into large language models could greatly improve accuracy, eliminate hallucinations and turn such tools into far more reliable resources, while his broader goal remains exposing flaws so data scientists can refine these systems and avoid building theories on suspect information.
