K Health is expanding its virtual primary care platform by upgrading an Artificial Intelligence physician that supports licensed clinicians across urgent care, chronic conditions, medical weight loss, and mental health. The organization moved its existing model to Gemma 3 hosted on Google Cloud’s Vertex AI, targeting a more conversational, empathetic, and professional intake experience while also reducing inferencing costs. The development strategy focused on the idea that a smaller, well-tuned model can outperform larger ones when trained to internalize decision-making logic instead of merely generating content.
After evaluating Llama and other open models, the team selected Gemma 3 on Vertex AI as the best balance of computational performance and cost. Engineers set up a structured procurement flow for multi-node (16) H100 GPU clusters and built reusable scripts to streamline training and inference across Gemma 3 4B, 12B, and 27B parameter variants, along with the MedGemma 27B model tailored for medical use. Using direct preference optimization, they generated 10 synthetic chats for each case, scored them on medical accuracy, conversational coherence, and clinical outcomes such as referrals, lab tests, or prescriptions, and then used a mix of best and worst conversations to teach the model the logic behind effective patient interactions. Gemma 3 4B showed a business score improvement from 0.48 to 0.76 with 10 epochs, while Gemma 3 12B reached a score of 0.81 after 20 training epochs, and MedGemma 27B recorded a business score of 0.71 but with higher inference cost.
The team ultimately highlighted Gemma 3 4B as the most successful configuration, validating the hypothesis that a smaller, general-purpose model fine-tuned with high-quality decision data can surpass a larger, domain-specific model in this setting. They adopted Axolotl Artificial Intelligence with Accelerate on a custom multi-node virtual machine as the optimal training stack, cutting training time by 66%, from 4.5 hours to just 1.5 hours. Techniques such as gradient checkpointing and 8-bit precision helped control memory use and prevent overfitting, achieving 90-95% accuracy under the chosen configuration. A self-reflection mechanism allowed the model to check its own outputs for factual consistency and conversational flow, which reduced the average number of API calls per chat from 100 to 60. Combined with Gemma’s lower inferencing costs, these gains produced substantial savings and yielded an intake system that K Health characterizes as significantly more natural, efficient, and conversational for clinical use.
