Exploring archives with probabilistic models: Topic Modelling for the valorisation of digitised archives of the European Commission

Hengchen, Simon; Coeckelbergs, Mathias; van Hooland, Seth; Verborgh, Ruben; Steiner, Thomas

Exploring archives with probabilistic models: Topic Modelling for the valorisation of digitised archives of the European Commission

Simon Hengchen, Mathias Coeckelbergs, Seth van Hooland, Ruben Verborgh, and Thomas Steiner

Topic Modelling (TM) has gained momentum over the last few years within the humanities to analyze topics represented in large volumes of full text. This paper proposes an experiment with the usage of TM based on a large subset of digitized archival holdings of the European Commission (EC). Currently, millions of scanned and OCR’ed files are available and hold the potential to significantly change the way historians of the construction and evolution of the European Union can perform their research. However, due to a lack of resources, only minimal metadata are available on a file and document level, seriously undermining the accessibility of this archival collection. The article explores in an empirical manner the possibilities and limits of TM to automatically extract key concepts from a large body of documents spanning multiple decades. By mapping the topics to headings of the EUROVOC thesaurus, the proof of concept described in this paper offers the future possibility to represent the identified topics with the help of a hierarchical search interface for end-users.

BibTeX other citation formats

Published in 2016 in Proceedings of the IEEE International Conference on Big Data.

Keywords:

proof
metadata
research

Read this article online

Request a digital copy of this article.
Comment on this article.

Cite this article in your work

Cite this article easily using its BibTeX entry:

@inproceedings{hengchen_cas_2016,
  author = {Hengchen, Simon and Coeckelbergs, Mathias and van Hooland, Seth and Verborgh, Ruben and Steiner, Thomas},
  title = {Exploring archives with probabilistic models: Topic Modelling for the valorisation of digitised archives of the {European Commission}},
  booktitle = {Proceedings of the IEEE International Conference on Big Data},
  editor = {Joshi, James and Karypis, George and Liu, Ling and Hu, Xiaohua and Ak, Ronay and Xia, Yinglong and Xu, Weijia and Sato, Aki-Hiro and Rachuri, Sudarsan and Ungar, Lyle and Yu, Philip S. and Govindaraju, Rama and Suzumura, Toyotaro},
  year = 2016,
  month = dec,
  pages = {3245--3249},
  publisher = {IEEE},
  doi = {10.1109/BigData.2016.7840981},
}

Alternatively, pick a reference of your choice below:

ACM: Simon Hengchen, Mathias Coeckelbergs, Seth van Hooland, Ruben Verborgh, and Thomas Steiner. 2016. Exploring archives with probabilistic models: Topic Modelling for the valorisation of digitised archives of the European Commission. In Proceedings of the IEEE International Conference on Big Data, IEEE, 3245–3249.
APA: Hengchen, S., Coeckelbergs, M., van Hooland, S., Verborgh, R., & Steiner, T. (2016). Exploring archives with probabilistic models: Topic Modelling for the valorisation of digitised archives of the European Commission. In J. Joshi, G. Karypis, L. Liu, X. Hu, R. Ak, Y. Xia, W. Xu, A.-H. Sato, S. Rachuri, L. Ungar, P. S. Yu, R. Govindaraju, & T. Suzumura (Eds.), Proceedings of the IEEE International Conference on Big Data (pp. 3245–3249). IEEE.
IEEE: S. Hengchen, M. Coeckelbergs, S. van Hooland, R. Verborgh, and T. Steiner, “Exploring archives with probabilistic models: Topic Modelling for the valorisation of digitised archives of the European Commission,” in Proceedings of the IEEE International Conference on Big Data, 2016, pp. 3245–3249.
LNCS: Hengchen, S., Coeckelbergs, M., van Hooland, S., Verborgh, R., Steiner, T.: Exploring archives with probabilistic models: Topic Modelling for the valorisation of digitised archives of the European Commission. In: Joshi, J., Karypis, G., Liu, L., Hu, X., Ak, R., Xia, Y., Xu, W., Sato, A.-H., Rachuri, S., Ungar, L., Yu, P.S., Govindaraju, R., and Suzumura, T. (eds.) Proceedings of the IEEE International Conference on Big Data. pp. 3245–3249. IEEE (2016).
MLA: Hengchen, Simon, et al. “Exploring Archives with Probabilistic Models: Topic Modelling for the Valorisation of Digitised Archives of the European Commission.” Proceedings of the IEEE International Conference on Big Data, edited by James Joshi et al., IEEE, 2016, pp. 3245–49.

Discuss this article

Discover all publications by Ruben Verborgh.
Find related articles on Google Scholar.
Post your questions or comments below.