Self-adaptive framework extracts earthquake data from web pages

May 7, 2026

A self-adaptive large language model framework is designed to extract and structure earthquake information from heterogeneous web sources by generating, validating, and reusing extraction schemas. In controlled tests, GPT_OSS delivered the strongest extraction quality, while selector errors were concentrated in wrong element selection and missing content.

A self-adaptive large language model framework is proposed to automate extraction and structuring of earthquake-related information from heterogeneous web sources. The system targets a gap between official seismic alerts and the broader, fragmented stream of earthquake reporting published across news sites, government portals, scientific platforms, and other online sources. Instead of relying on handcrafted parsers, the framework uses transformer-based schema generation to infer selectors from sanitized HTML, then validates outputs and stores successful schemas for reuse on structurally similar pages.

The pipeline combines keyword-driven web search, HTML retrieval and sanitization, schema generation, Python-based content extraction, validation, and structured storage. A repository-guided matching mechanism compares new documents with previously validated schema profiles using embeddings and a similarity threshold, allowing the system to reuse schemas when layouts are related. If reuse fails, the model generates a new schema and tests it through an iterative refinement cycle. The formal model represents webpages as DOM trees, defines field-level utility scores based on length, structure, and semantic adequacy, and selects schemas by maximizing utility under a validity constraint.

The evaluation was conducted on a heterogeneous collection of 17 documents from multiple domains, including news, scientific, and governmental websites. All models achieved the same success rate of 85.0% but differed substantially in extraction accuracy and judged quality. GPT_OSS achieved the best overall results (extraction acc = 96.5%, final score = 84.26), followed by GEMMA (extraction acc = 92.4%, final score = 80.68). LLAMA showed a markedly lower extraction accuracy (52.4%) and the lowest overall score (63.13), indicating reduced robustness to heterogeneous HTML structures.

Additional metrics reinforced the performance gap. In Table 2, GPT_OSS recorded precision 0.965, recall 1.000, and F1-Score 0.982, while GEMMA reached 0.924, 0.941, and 0.933, and LLAMA reached 0.524, 0.706, and 0.601. Expert evaluation by five independent specialists also favored GPT_OSS, which achieved an average score of 9.15, compared with 8.30 for GEMMA and 6.20 for LLAMA. Across the 17-document evaluation subset, GPT_OSS achieved the highest mean extraction score (0.965) and the best mean date quality (9.11/10), whereas LLAMA produced 5 out of 17 zero-text outputs.

Error analysis showed that the dominant failure modes were Wrong Element and Missing Content, especially on domains with dense boilerplate or complex nested templates. https://tengrinews.kz recorded 50 total errors and a penalty of −64.5, while https://www.volcanodiscovery.com recorded 28 total errors and a penalty of −44.5. The findings suggest that extraction quality depends heavily on the model’s ability to isolate the main content block and identify publication dates in noisy HTML. The study positions the framework as a reproducible foundation for future work on stronger selector robustness, richer validation criteria, hybrid fallback extraction, and larger annotated benchmarks.

Source

52

Impact Score

Latest News

Ibm expands enterprise Artificial Intelligence for hybrid cloud and mainframes

May 8, 2026

Ibm introduced new enterprise Artificial Intelligence products centered on hybrid infrastructure, mainframes and agent orchestration. The updates reinforce its multi-model strategy while targeting regulated and on-premises environments where business data remains entrenched.

Artificial Intelligence diffusion lags frontier gains

May 8, 2026

Rapid advances in Artificial Intelligence capability are not translating automatically into broad productivity growth or equitable gains. Diffusion remains uneven across firms, sectors, countries, and workers, pushing policymakers to focus on skills, governance, procurement, and measurement.

Study finds widespread weaknesses in autonomous agents

May 7, 2026

A multi-institution study found that autonomous agents across several sectors are highly exposed to tool-chaining, goal drift, and memory poisoning attacks. The findings suggest agentic systems face broader and deeper security risks than stateless large language models.

Pennsylvania police use Artificial Intelligence body cameras for translation

May 7, 2026

East Lansdowne police are using body cameras with built-in Artificial Intelligence translation to communicate with people who do not speak English. Officials say the system is improving field interviews, speeding up responses, and strengthening community trust.

Federal safety net unprepared for Artificial Intelligence job losses

May 7, 2026

Economists are warning that the federal system designed to support displaced workers is not equipped for a wave of job losses tied to Artificial Intelligence. Existing unemployment benefits and retraining programs are widely seen as too limited to manage broad disruption.

Self-adaptive framework extracts earthquake data from web pages

52

Impact Score

Latest News

Ibm expands enterprise Artificial Intelligence for hybrid cloud and mainframes

Artificial Intelligence diffusion lags frontier gains

Study finds widespread weaknesses in autonomous agents

Pennsylvania police use Artificial Intelligence body cameras for translation

Federal safety net unprepared for Artificial Intelligence job losses

Contact Us