Assessing the Comparative Effectiveness of Text Models in the Document Classification Problem

Authors

  • Олег [Oleg] Васильевич [V.] Бартеньев [Bartenyev]

DOI:

https://doi.org/10.24160/1993-6982-2021-5-117-127

Keywords:

text model, classifier, neural network, data set

Abstract

Various text models used in solving natural language processing problems are considered. Text models are used to perform document classification, the results of which are then used to estimate the comparative effectiveness of the used models. From two classification accuracy values obtained on the evaluation and training sets, the minimum value is selected to evaluate the model. A multilayer perceptron with one hidden layer is used as a classifier. The classifier input receives a real vector representing the document. At its output, the classifier generates a forecast about the document class. The input vector is determined, depending on the used text model, either by the text frequency characteristics, or by distributed vector representations of the pre-trained text model's tokens. The obtained results demonstrate the advantage of models based on the Transformer architecture over other models used in the study, e.g., the word2vec, doc2vec, and fasttext models.

Author Biography

Олег [Oleg] Васильевич [V.] Бартеньев [Bartenyev]

Ph.D. (Techn.), Assistant Professor of Applied Mathematics and Artificial Intelligence Dept., NRU MPEI, e-mail: mdf4@mail.ru

References

1. Radford A. e. a. Improving Language Understanding by Generative Pre-Training [Электрон. ресурс] www.s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf (дата обращения 01.03.2021).
2. Devlin J. e. a. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [Электрон. ресурс] www.arxiv.org/pdf/1810.04805.pdf (дата обращения 01.03.2021).
3. Vaswani A. e. a. Attention Is All You Need [Электрон. ресурс] www.arxiv.org/pdf/1706.03762.pdf (дата обращения 01.03.2021).
4. Lan Z. e. a. ALBERT: A lite BERT for Self-supervised Learning of Language Representations [Электрон. ресурс] www.arxiv.org/pdf/1909.11942.pdf (дата обращения 01.03.2021).
5. Sanh V. e. a. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. [Электрон. ресурс] www.arxiv.org/pdf/1910.01108.pdf (дата обращения 01.03.2021).
6. Mikolov T. e.a. Distributed Representations of Words and Phrases and their Compositionality [Электрон. ресурс] www.arxiv.org/abs/1310.4546 (дата обращения 01.03.2021).
7. Bojanowski P. e. a. Enriching Word Vectors with Subword Information [Электрон. ресурс] www.arxiv.org/pdf/1607.04606.pdf (дата обращения 01.03.2021).
8. Бенгфорт Б., Билбро Р., Охеда Т. Прикладной анализ текстовых данных на Python. Машинное обучение и создание приложений обработки естественного языка. СПб.: Питер, 2019.
9. Бартеньев О.В. Программирование моделей текста на Python [Электрон. ресурс] www.100byte.ru/python/text_models/text_models.html (дата обращения 01.03.2021).
10. Keras: The Python Deep Learning Library [Электрон. ресурс] www.keras.io/ (дата обращения 01.03.2021).
11. Апарнев А.Н., Бартеньев О.В. Анализ функций потерь при обучении сверточных нейронных сетей с оптимизатором Adam для классификации изображений // Вестник МЭИ. 2020. № 2. С. 90—105.
12. Бартеньев О.В. Параметры, влияющие на эффективность нейронной сети, созданной средствами Keras [Электрон. ресурс] www.100byte.ru/python/factors/factors.html#p8 (дата обращения 01.03.2021).
13. BBC Dataset. [Электрон. ресурс] www.mlg.ucd.ie/datasets/bbc.html (дата обращения 01.03.2021).
14. OneHotEncoder [Электрон. ресурс] www.scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html (дата обращения 01.03.2021).
15. TfidfVectorizer. [Электрон. ресурс] www.scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html (дата обращения 01.03.2021).
16. CountVectorizer [Электрон. ресурс] www.scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html (дата обращения 01.03.2021).
17. Blei D.M., Ng A.Y., Jordan M.I. Latent Dirichlet Allocation [Электрон. ресурс] www.jmlr.org/papers/volume3/blei03a/blei03a.pdf (дата обращения 01.03.2021).
18. LatentDirichletAllocation. [Электрон. ресурс] www.scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html (дата обращения 01.03.2021).
19. Word2vec embeddings [Электрон. ресурс] www.radimrehurek.com/gensim/models/word2vec.html (дата обращения 01.03.2021).
20. Le Q., Mikolov T. Distributed Representations of Sentences and Documents. [Электрон. ресурс] www.cs.stanford.edu/~quocle/paragraph_vector.pdf (дата обращения 01.03.2021).
21. Doc2vec Paragraph Embeddings. [Электрон. ресурс] www.radimrehurek.com/gensim/models/doc2vec.html (дата обращения 01.03.2021).
22. FastText model [Электрон. ресурс] www.radimrehurek.com/gensim/models/fasttext.html (дата обращения 01.03.2021).
23. Pennington J., Socher R., Manning C.D. GloVe: Global Vectors for Word Representation. [Электрон. ресурс] www.nlp.stanford.edu/pubs/glove.pdf (дата обращения 01.03.2021).
24. Download Pre-trained Word Vectors [Электрон. ресурс] www.nlp.stanford.edu/projects/glove/ (дата обращения 01.03.2021).
25. Source Code for transformers.models.bert.configuration_bert. [Электрон. ресурс] www.huggingface.co/transformers/_modules/transformers/models/bert/configuration_bert.html#BertConfig (дата обращения 01.03.2021).
26. Liu Y. e. a. RoBERTa: A Robustly Optimized BERT Pretraining Approach. [Электрон. ресурс] www.arxiv.org/pdf/1907.11692.pdf (дата обращения 01.03.2021).
27. Rothe S., Narayan S., Severyn A. Leveraging Pre-trained Checkpoints for Sequence Generation Tasks [Электрон. ресурс] www.arxiv.org/pdf/1907.12461.pdf (дата обращения 01.03.2021).
28. Jiang Z. e. a. ConvBERT: Improving BERT with Span-based Dynamic Convolution [Электрон. ресурс] www.arxiv.org/pdf/2008.02496.pdf (дата обращения 01.03.2021).
29. Lewis M. e. a. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension [Электрон. ресурс] www.arxiv.org/pdf/1910.13461.pdf (дата обращения 01.03.2021).
31. He P. e. a. DEBERTA: Decoding-enhanced BERT with Disentangled Attention [Электрон. ресурс] www.arxiv.org/pdf/2006.03654.pdf (дата обращения 01.03.2021).
31. Clark K. e. a. ELECTRA: Pre-training Text Encoders as Discriminators rather than Generators [Электрон. ресурс] www.openreview.net/pdf?id=r1xMH1BtvB (дата обращения 01.03.2021).
32. Dai Z. e. a. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing [Электрон. ресурс] www.arxiv.org/pdf/2006.03236.pdf (дата обращения 01.03.2021).
33. Beltagy I., Peters M. E., Cohan A. Longformer: The Long-document Transformer [Электрон. ресурс] www.arxiv.org/pdf/2004.05150.pdf (дата обращения 01.03.2021).
34. Yang Z. e. a. XLNet: Generalized Autoregressive Pretraining for Language Understanding [Электрон. ресурс] www.arxiv.org/pdf/1906.08237.pdf (дата обращения 01.03.2021).
35. Song K. e. a. MPNet: Masked and Permuted Pre-training for Language Understanding. [Электрон. ресурс] www.arxiv.org/pdf/2004.09297.pdf (дата обращения 01.03.2021).
36. Iandola F.N. e. a. SqueezeBERT: What can Computer Vision Teach NLP about Efficient Neural Networks? [Электрон. ресурс] www.arxiv.org/pdf/2006.11316.pdf (дата обращения 01.03.2021).
37. Conneau A. e. a. Unsupervised Cross-lingual Representation Learning at Scale [Электрон. ресурс] www.arxiv.org/pdf/1911.02116.pdf (дата обращения 01.03.2021).
38. NLPL Word Embeddings Repository [Электрон. ресурс] http://vectors.nlpl.eu/repository/ (дата обращения 01.03.2021).
---
Для цитирования: Бартеньев О.В. Сравнительная оценка эффективности моделей текста в задаче классификации документов // Вестник МЭИ. 2021. № 5. С. 117—127. DOI: 10.24160/1993-6982-2021-5-117-127.
#
1. Radford A. e. a. Improving Language Understanding by Generative Pre-Training [Elektron. Resurs] www.s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf (Data Obrashcheniya 01.03.2021).
2. Devlin J. e. a. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [Elektron. Resurs] www.arxiv.org/pdf/1810.04805.pdf (Data Obrashcheniya 01.03.2021).
3. Vaswani A. e. a. Attention Is All You Need [Elektron. Resurs] www.arxiv.org/pdf/1706.03762.pdf (Data Obrashcheniya 01.03.2021).
4. Lan Z. e. a. ALBERT: A lite BERT for Self-supervised Learning of Language Representations [Elektron. Resurs] www.arxiv.org/pdf/1909.11942.pdf (Data Obrashcheniya 01.03.2021).
5. Sanh V. e. a. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. [Elektron. Resurs] www.arxiv.org/pdf/1910.01108.pdf (Data Obrashcheniya 01.03.2021).
6. Mikolov T. e.a. Distributed Representations of Words and Phrases and their Compositionality [Elektron. Resurs] www.arxiv.org/abs/1310.4546 (Data Obrashcheniya 01.03.2021).
7. Bojanowski P. e. a. Enriching Word Vectors with Subword Information [Elektron. Resurs] www.arxiv.org/pdf/1607.04606.pdf (Data Obrashcheniya 01.03.2021).
8. Bengfort B., Bilbro R., Okheda T. Prikladnoy Analiz Tekstovykh Dannykh na Python. Mashinnoe Obuchenie i Sozdanie Prilozheniy Obrabotki Estestvennogo Yazyka. SPb.: Piter, 2019. (in Russian).
9. Barten'ev O.V. Programmirovanie Modeley Teksta na Python [Elektron. Resurs] www.100byte.ru/python/text_models/text_models.html (Data Obrashcheniya 01.03.2021). (in Russian).
10. Keras: The Python Deep Learning Library [Elektron. Resurs] www.keras.io/ (Data Obrashcheniya 01.03.2021).
11. Aparnev A.N., Barten'ev O.V. Analiz Funktsiy Poter' pri Obuchenii Svertochnykh Neyronnykh Setey s Optimizatorom Adam dlya Klassifikatsii Izobrazheniy. Vestnik MEI. 2020;2:90—105. (in Russian).
12. Barten'ev O.V. Parametry, Vliyayushchie na Effektivnost' Neyronnoy Seti, Sozdannoy Sredstvami Keras [Elektron. Resurs] www.100byte.ru/python/factors/factors.html#p8 (Data Obrashcheniya 01.03.2021). (in Russian).
13. BBC Dataset. [Elektron. Resurs] www.mlg.ucd.ie/datasets/bbc.html (Data Obrashcheniya 01.03.2021).
14. OneHotEncoder [Elektron. Resurs] www.scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html (Data Obrashcheniya 01.03.2021).
15. TfidfVectorizer. [Elektron. Resurs] www.scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html (Data Obrashcheniya 01.03.2021).
16. CountVectorizer [Elektron. Resurs] www.scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html (Data Obrashcheniya 01.03.2021).
17. Blei D.M., Ng A.Y., Jordan M.I. Latent Dirichlet Allocation [Elektron. Resurs] www.jmlr.org/papers/volume3/blei03a/blei03a.pdf (Data Obrashcheniya 01.03.2021).
18. LatentDirichletAllocation. [Elektron. Resurs] www.scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html (Data Obrashcheniya 01.03.2021).
19. Word2vec embeddings [Elektron. Resurs] www.radimrehurek.com/gensim/models/word2vec.html (Data Obrashcheniya 01.03.2021).
20. Le Q., Mikolov T. Distributed Representations of Sentences and Documents. [Elektron. Resurs] www.cs.stanford.edu/~quocle/paragraph_vector.pdf (Data Obrashcheniya 01.03.2021).
21. Doc2vec Paragraph Embeddings. [Elektron. Resurs] www.radimrehurek.com/gensim/models/doc2vec.html (Data Obrashcheniya 01.03.2021).
22. FastText model [Elektron. Resurs] www.radimrehurek.com/gensim/models/fasttext.html (Data Obrashcheniya 01.03.2021).
23. Pennington J., Socher R., Manning C.D. GloVe: Global Vectors for Word Representation. [Elektron. Resurs] www.nlp.stanford.edu/pubs/glove.pdf (Data Obrashcheniya 01.03.2021).
24. Download Pre-trained Word Vectors [Elektron. Resurs] www.nlp.stanford.edu/projects/glove/ (Data Obrashcheniya 01.03.2021).
25. Source Code for transformers.models.bert.configuration_bert. [Elektron. Resurs] www.huggingface.co/transformers/_modules/transformers/models/bert/configuration_bert.html#BertConfig (Data Obrashcheniya 01.03.2021).
26. Liu Y. e. a. RoBERTa: A Robustly Optimized BERT Pretraining Approach. [Elektron. Resurs] www.arxiv.org/pdf/1907.11692.pdf (Data Obrashcheniya 01.03.2021).
27. Rothe S., Narayan S., Severyn A. Leveraging Pre-trained Checkpoints for Sequence Generation Tasks [Elektron. Resurs] www.arxiv.org/pdf/1907.12461.pdf (Data Obrashcheniya 01.03.2021).
28. Jiang Z. e. a. ConvBERT: Improving BERT with Span-based Dynamic Convolution [Elektron. Resurs] www.arxiv.org/pdf/2008.02496.pdf (Data Obrashcheniya 01.03.2021).
29. Lewis M. e. a. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension [Elektron. Resurs] www.arxiv.org/pdf/1910.13461.pdf (Data Obrashcheniya 01.03.2021).
31. He P. e. a. DEBERTA: Decoding-enhanced BERT with Disentangled Attention [Elektron. Resurs] www.arxiv.org/pdf/2006.03654.pdf (Data Obrashcheniya 01.03.2021).
31. Clark K. e. a. ELECTRA: Pre-training Text Encoders as Discriminators rather than Generators [Elektron. Resurs] www.openreview.net/pdf?id=r1xMH1BtvB (Data Obrashcheniya 01.03.2021).
32. Dai Z. e. a. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing [Elektron. Resurs] www.arxiv.org/pdf/2006.03236.pdf (Data Obrashcheniya 01.03.2021).
33. Beltagy I., Peters M. E., Cohan A. Longformer: The Long-document Transformer [Elektron. Resurs] www.arxiv.org/pdf/2004.05150.pdf (Data Obrashcheniya 01.03.2021).
34. Yang Z. e. a. XLNet: Generalized Autoregressive Pretraining for Language Understanding [Elektron. Resurs] www.arxiv.org/pdf/1906.08237.pdf (Data Obrashcheniya 01.03.2021).
35. Song K. e. a. MPNet: Masked and Permuted Pre-training for Language Understanding. [Elektron. Resurs] www.arxiv.org/pdf/2004.09297.pdf (Data Obrashcheniya 01.03.2021).
36. Iandola F.N. e. a. SqueezeBERT: What can Computer Vision Teach NLP about Efficient Neural Networks? [Elektron. Resurs] www.arxiv.org/pdf/2006.11316.pdf (Data Obrashcheniya 01.03.2021).
37. Conneau A. e. a. Unsupervised Cross-lingual Representation Learning at Scale [Elektron. Resurs] www.arxiv.org/pdf/1911.02116.pdf (Data Obrashcheniya 01.03.2021).
38. NLPL Word Embeddings Repository [Elektron. Resurs] http://vectors.nlpl.eu/repository/ (Data Obrashcheniya 01.03.2021).
---
For citation: Bartenyev O.V. Assessing the Comparative Effectiveness of Text Models in the Document Classification Problem. Bulletin of MPEI. 2021;5:117—127. (in Russian). DOI: 10.24160/1993-6982-2021-5-117-127.

Published

2021-03-03

Issue

Section

System Analysis, Management and Information Processing (05.13.01)