Norway’s National Library is building a sovereign Norwegian-language large language model using 2 PB of Huawei OceanStor Dorado all-flash storage as part of its Artificial Intelligence training data pipeline. The effort is driven by the absence of any commercial large language model trained on Norwegian language and culture.
The library holds 20 PB of unique digitized content, including books, newspapers, web pages, and broadcasts, and it has exclusive rights to train on copyrighted Norwegian newspaper content. The main bottleneck is not compute but data quality, cleaning, and pipeline throughput. The project centers on turning a large national collection into training-ready data that can support a model tailored to Norway’s language and cultural context.
Data flows from a 60 PB preservation archive through an on-premises pipeline that includes Nvidia DGX H200 systems, a CPU cluster, and Huawei flash arrays before reaching Norway’s Sigma2 Olivia national supercomputer for actual training. A core technical challenge is bridging high-latency archival storage with low-latency storage required for the Artificial Intelligence pipeline.
Additional work focuses on evaluation and governance. The project requires custom large language model evaluation tools for a language environment with two written forms and multiple dialects. It also raises governance questions around access and usage of a sovereign Artificial Intelligence system built on nationally held cultural and media data.
