Self-adaptive framework extracts earthquake data from web pages

A self-adaptive large language model framework is designed to extract and structure earthquake information from heterogeneous web sources by generating, validating, and reusing extraction schemas. In controlled tests, GPT_OSS delivered the strongest extraction quality, while selector errors were concentrated in wrong element selection and missing content.

A self-adaptive large language model framework is proposed to automate extraction and structuring of earthquake-related information from heterogeneous web sources. The system targets a gap between official seismic alerts and the broader, fragmented stream of earthquake reporting published across news sites, government portals, scientific platforms, and other online sources. Instead of relying on handcrafted parsers, the framework uses transformer-based schema generation to infer selectors from sanitized HTML, then validates outputs and stores successful schemas for reuse on structurally similar pages.

The pipeline combines keyword-driven web search, HTML retrieval and sanitization, schema generation, Python-based content extraction, validation, and structured storage. A repository-guided matching mechanism compares new documents with previously validated schema profiles using embeddings and a similarity threshold, allowing the system to reuse schemas when layouts are related. If reuse fails, the model generates a new schema and tests it through an iterative refinement cycle. The formal model represents webpages as DOM trees, defines field-level utility scores based on length, structure, and semantic adequacy, and selects schemas by maximizing utility under a validity constraint.

The evaluation was conducted on a heterogeneous collection of 17 documents from multiple domains, including news, scientific, and governmental websites. All models achieved the same success rate of 85.0% but differed substantially in extraction accuracy and judged quality. GPT_OSS achieved the best overall results (extraction acc = 96.5%, final score = 84.26), followed by GEMMA (extraction acc = 92.4%, final score = 80.68). LLAMA showed a markedly lower extraction accuracy (52.4%) and the lowest overall score (63.13), indicating reduced robustness to heterogeneous HTML structures.

Additional metrics reinforced the performance gap. In Table 2, GPT_OSS recorded precision 0.965, recall 1.000, and F1-Score 0.982, while GEMMA reached 0.924, 0.941, and 0.933, and LLAMA reached 0.524, 0.706, and 0.601. Expert evaluation by five independent specialists also favored GPT_OSS, which achieved an average score of 9.15, compared with 8.30 for GEMMA and 6.20 for LLAMA. Across the 17-document evaluation subset, GPT_OSS achieved the highest mean extraction score (0.965) and the best mean date quality (9.11/10), whereas LLAMA produced 5 out of 17 zero-text outputs.

Error analysis showed that the dominant failure modes were Wrong Element and Missing Content, especially on domains with dense boilerplate or complex nested templates. https://tengrinews.kz recorded 50 total errors and a penalty of −64.5, while https://www.volcanodiscovery.com recorded 28 total errors and a penalty of −44.5. The findings suggest that extraction quality depends heavily on the model’s ability to isolate the main content block and identify publication dates in noisy HTML. The study positions the framework as a reproducible foundation for future work on stronger selector robustness, richer validation criteria, hybrid fallback extraction, and larger annotated benchmarks.

52

Impact Score

Artificial Intelligence diffusion lags frontier gains

Rapid advances in Artificial Intelligence capability are not translating automatically into broad productivity growth or equitable gains. Diffusion remains uneven across firms, sectors, countries, and workers, pushing policymakers to focus on skills, governance, procurement, and measurement.

Study finds widespread weaknesses in autonomous agents

A multi-institution study found that autonomous agents across several sectors are highly exposed to tool-chaining, goal drift, and memory poisoning attacks. The findings suggest agentic systems face broader and deeper security risks than stateless large language models.

Federal safety net unprepared for Artificial Intelligence job losses

Economists are warning that the federal system designed to support displaced workers is not equipped for a wave of job losses tied to Artificial Intelligence. Existing unemployment benefits and retraining programs are widely seen as too limited to manage broad disruption.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.