A self-adaptive large language model framework is proposed to automate extraction and structuring of earthquake-related information from heterogeneous web sources. The system targets a gap between official seismic alerts and the broader, fragmented stream of earthquake reporting published across news sites, government portals, scientific platforms, and other online sources. Instead of relying on handcrafted parsers, the framework uses transformer-based schema generation to infer selectors from sanitized HTML, then validates outputs and stores successful schemas for reuse on structurally similar pages.
The pipeline combines keyword-driven web search, HTML retrieval and sanitization, schema generation, Python-based content extraction, validation, and structured storage. A repository-guided matching mechanism compares new documents with previously validated schema profiles using embeddings and a similarity threshold, allowing the system to reuse schemas when layouts are related. If reuse fails, the model generates a new schema and tests it through an iterative refinement cycle. The formal model represents webpages as DOM trees, defines field-level utility scores based on length, structure, and semantic adequacy, and selects schemas by maximizing utility under a validity constraint.
The evaluation was conducted on a heterogeneous collection of 17 documents from multiple domains, including news, scientific, and governmental websites. All models achieved the same success rate of 85.0% but differed substantially in extraction accuracy and judged quality. GPT_OSS achieved the best overall results (extraction acc = 96.5%, final score = 84.26), followed by GEMMA (extraction acc = 92.4%, final score = 80.68). LLAMA showed a markedly lower extraction accuracy (52.4%) and the lowest overall score (63.13), indicating reduced robustness to heterogeneous HTML structures.
Additional metrics reinforced the performance gap. In Table 2, GPT_OSS recorded precision 0.965, recall 1.000, and F1-Score 0.982, while GEMMA reached 0.924, 0.941, and 0.933, and LLAMA reached 0.524, 0.706, and 0.601. Expert evaluation by five independent specialists also favored GPT_OSS, which achieved an average score of 9.15, compared with 8.30 for GEMMA and 6.20 for LLAMA. Across the 17-document evaluation subset, GPT_OSS achieved the highest mean extraction score (0.965) and the best mean date quality (9.11/10), whereas LLAMA produced 5 out of 17 zero-text outputs.
Error analysis showed that the dominant failure modes were Wrong Element and Missing Content, especially on domains with dense boilerplate or complex nested templates. https://tengrinews.kz recorded 50 total errors and a penalty of −64.5, while https://www.volcanodiscovery.com recorded 28 total errors and a penalty of −44.5. The findings suggest that extraction quality depends heavily on the model’s ability to isolate the main content block and identify publication dates in noisy HTML. The study positions the framework as a reproducible foundation for future work on stronger selector robustness, richer validation criteria, hybrid fallback extraction, and larger annotated benchmarks.
