Automatic Placement of Punctuation Marks by Using Neural Networks

Authors

  • Олег [Oleg] Васильевич [V.] Бартеньев [Bartenyev]

DOI:

https://doi.org/10.24160/1993-6982-2022-6-146-159

Keywords:

punctuation marks, classifier, transformer, neural network, data set

Abstract

The problem of automatically placing punctuation marks in a text divided into sentences is solved. The positions of punctuation marks such as comma, dash, colon, exclamation, and question marks are determined. Two approaches to solving the problem are considered. In the first one, the task is reduced to the classification of n-grams. The class of an n-gram is determined by the type of punctuation marks after its k-th token (n = 5, k = 3). A multilayer perceptron is used as a classifier, the input of which receives vector representations of n-grams formed using the word2vec model. In the second one, a neural network with a transformer architecture that receives an input sequence of tokens (IS) freed from punctuation marks is trained to generate a target sequence of tokens (TS) that allows punctuation marks to be placed in the original sentence. The TS is generated by the IS as a result of replacing tokens associated with punctuation marks with corresponding markers. The IS tokens that are not associated with punctuation marks are transferred to the TS without changes. To reduce the dictionary of tokens, word forms are replaced by lemmas, and text elements containing characters other than letters of the Russian alphabet are replaced by special tokens. In addition, for the same purpose, names, patronymics, surnames, toponyms, and numerals are replaced by special tokens. The classifier of word forms is proposed as a tool that defines parts of speech and named entities. Two types of IS and TS are considered. The IS of the second type is formed by the IS of the first type as a result of adding part of speech designations to the lemmas. TS differ in the way the tokens associated with commas are replaced by the corresponding markers. In TS of the first type, the token and the subsequent comma are replaced by a marker; in the TS of the second type, the token and the preceding comma are replaced by a marker. The effectiveness of the models is estimated by the precision and F1 indicators, which are calculated for each class and then averaged. The value of F1 is equal to 0.77 and 0.86 in the cases of using a classifier and transformer, respectively.

Author Biography

Олег [Oleg] Васильевич [V.] Бартеньев [Bartenyev]

Ph.D. (Techn.), Assistant Professor of Applied Mathematics and Artificial Intelligence Dept., NRU MPEI, e-mail: mdf4@mail.ru

References

1. Текстовод [Электрон. ресурс] https://textovod.com/punctuation (дата обращения 01.02.2022).
2. Tilk1 O., Alumae T. Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration // Proc. INTERSPEECH Conf. San Francisco, 2016. Pp. 3047—3051.
3. Che X. e. a. Punctuation Prediction for Unsegmented Transcript Based on Word Vector // Proc. Intern. Conf. Language Resources and Evaluation. Bern, 2016. Pp. 654—658.
4. Pennington J., Socher R., Manning C.D. GloVe: Global Vectors for Word Representation. [Электрон. ресурс] https://nlp.stanford.edu/projects/glove/ (дата обращения 01.02.2022).
5. Nagy A., Bial B., Ács J. Automatic Punctuation Restoration with BERT Models [Электрон. ресурс] https://arxiv.org/pdf/2101.07343.pdf (дата обращения 01.02.2022).
6. Yi J. e. a. Adversarial Transfer Learning for Punctuation Restorati [Электрон. ресурс] https://arxiv.org/pdf/2004.00248.pdf (дата обращения 01.02.2022).
7. Условные случайные поля (CRF): краткий обзор [Офиц. сайт] http://nlpx.net/archives/439 (дата обращения 01.02.2022).
8. Keras [Офиц. сайт] https://keras.io/ (дата обращения 01.02.2022).
9. Vaswani A. e. a. Attention Is All You Need // Proc. 31st Conf. Neural Proc. Systems. Long Beach, 2017. Pp. 1—15.
10. PyTorch [Офиц. сайт] https://pytorch.org/ (дата обращения 01.02.2022).
11. Открытый корпус [Офиц. сайт] http://opencorpora.org/ (дата обращения 01.02.2022).
12. BERT in DeepPavlov. [Офиц. сайт] http://docs.deeppavlov.ai/en/master/features/models/bert.html (дата обращения 01.02.2022).
13. Морфологический словарь русского языка [Офиц. сайт] https://morfologija.ru/ (дата обращения 01.02.2022).
14. Большой академический словарь русского языка [Электрон. ресурс] https://www.livelib.ru/pubseries/1658152-bolshoj-akademicheskij-slovar-russkogo-yazyka (дата обращения 01.02.2022).
15. Морфологический анализатор pymorphy2 [Офиц. сайт] https://pymorphy2.readthedocs.io/en/stable/ (дата обращения 01.02.2022).
16. Mikolov T. e. a. Distributed Representations of Words and Phrases and Their Compositionality [Электрон. ресурс] www.arxiv.org/abs/1310.4546 (дата обращения 01.02.2022).
17. LOGSOFTMAX [Электрон. ресурс] https://pytorch.org/docs/stable/generated/torch.nn.LogSoftmax.html (дата обращения 01.02.2022).
18. KLDIVLOSS [Электрон. ресурс] https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html (дата обращения 01.02.2022).
19. The Annotated Transformer [Электрон. ресурс] http://nlp.seas.harvard.edu/2018/04/03/attention.html (дата обращения 01.02.2022).
20. Ba J.L., Kiros J.R., Hinton G.E. Layer Normalization [Электрон. ресурс] https://arxiv.org/abs/1607.06450 (дата обращения 01.02.2022).
21. Srivastava N. e. a. Dropout: a Simple Way to Prevent Neural Networks from Overfitting // J. Machine Research. 2014. V. 15. Pp. 1929—1958.
---
Для цитирования: Бартеньев О.В. Автоматическая расстановка знаков препинания с помощью нейронных сетей // Вестник МЭИ. 2022. № 6. С. 146—159. DOI: 10.24160/1993-6982-2022-6-146-159
#
1. Tekstovod [Elektron. Resurs] https://textovod.com/punctuation (Data Obrashcheniya 01.02.2022). (in Russian).
2. Tilk1 O., Alumae T. Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration. Proc. INTERSPEECH Conf. San Francisco, 2016:3047—3051.
3. Che X. e. a. Punctuation Prediction for Unsegmented Transcript Based on Word Vector. Proc. Intern. Conf. Language Resources and Evaluation. Bern, 2016:654—658.
4. Pennington J., Socher R., Manning C.D. GloVe: Global Vectors for Word Representation. [Elektron. Resurs] https://nlp.stanford.edu/projects/glove/ (Data Obrashcheniya 01.02.2022).
5. Nagy A., Bial B., Ács J. Automatic Punctuation Restoration with BERT Models [Elektron. Resurs] https://arxiv.org/pdf/2101.07343.pdf (Data Obrashcheniya 01.02.2022).
6. Yi J. e. a. Adversarial Transfer Learning for Punctuation Restorati [Elektron. Resurs] https://arxiv.org/pdf/2004.00248.pdf (Data Obrashcheniya 01.02.2022).
7. Uslovnye Sluchaynye Polya (CRF): Kratkiy Obzor [Ofits. Sayt] http://nlpx.net/archives/439 (Data Obrashcheniya 01.02.2022). (in Russian).
8. Keras [Ofits. Sayt] https://keras.io/ (Data Obrashcheniya 01.02.2022).
9. Vaswani A. e. a. Attention Is All You Need. Proc. 31st Conf. Neural Proc. Systems. Long Beach, 2017:1—15.
10. PyTorch [Ofits. Sayt] https://pytorch.org/ (Data Obrashcheniya 01.02.2022).
11. Otkrytyy Korpus [Ofits. Sayt] http://opencorpora.org/ (Data Obrashcheniya 01.02.2022). (in Russian).
12. BERT in DeepPavlov. [Ofits. Sayt] http://docs.deeppavlov.ai/en/master/features/models/bert.html (Data Obrashcheniya 01.02.2022).
13. Morfologicheskiy Slovar' Russkogo Yazyka [Ofits. Sayt] https://morfologija.ru/ (Data Obrashcheniya 01.02.2022). (in Russian).
14. Bol'shoy Akademicheskiy Slovar' Russkogo Yazyka [Elektron. Resurs] https://www.livelib.ru/pubseries/1658152-bolshoj-akademicheskij-slovar-russkogo-yazyka (Data Obrashcheniya 01.02.2022). (in Russian).
15. Morfologicheskiy Analizator pymorphy2 [Ofits. Sayt] https://pymorphy2.readthedocs.io/en/stable/ (Data Obrashcheniya 01.02.2022). (in Russian).
16. Mikolov T. e. a. Distributed Representations of Words and Phrases and Their Compositionality [Elektron. Resurs] www.arxiv.org/abs/1310.4546 (Data Obrashcheniya 01.02.2022).
17. LOGSOFTMAX [Elektron. Resurs] https://pytorch.org/docs/stable/generated/torch.nn.LogSoftmax.html (Data Obrashcheniya 01.02.2022).
18. KLDIVLOSS [Elektron. Resurs] https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html (Data Obrashcheniya 01.02.2022).
19. The Annotated Transformer [Elektron. Resurs] http://nlp.seas.harvard.edu/2018/04/03/attention.html (Data Obrashcheniya 01.02.2022).
20. Ba J.L., Kiros J.R., Hinton G.E. Layer Normalization [Elektron. Resurs] https://arxiv.org/abs/1607.06450 (Data Obrashcheniya 01.02.2022).
21. Srivastava N. e. a. Dropout: a Simple Way to Prevent Neural Networks from Overfitting. J. Machine Research. 2014;15:1929—1958.
---
For citation: Bartenyev O.V. Automatic Placement of Punctuation Marks by Using Neural Networks. Bulletin of MPEI. 2022;6:146—159. (in Russian). DOI: 10.24160/1993-6982-2022-6-146-159

Published

2022-01-19

Issue

Section

Mathematical and Software Support of Computer Systems, Complexes and Computer Networks (Technical Sciences) (2.3.5)