ЖурналыРечевые технологииВыпуск №1/2019

Габдрахманова Н. Т.
Кластеризация документов с помощью нейронных сетей

купить статью за
50 руб

Аннотация. В работе рассматривается задача автоматизации кластеризации документов, классификации документов и динамической классификации документов (ситуационная задача). Предлагается метод кластеризации с использованием локального коэффициента кластеризации для графов. Алгоритм кластеризации основан на структурном анализе графа. Представление текста в виде графа позволяет определить дискретный аналог кривизны Риччи на метрическом пространстве, как это сделано в работах Олливье. Для решения задачи классификации документов с помощью нейронных сетей предложены регуляризаторы на основе введенных понятий.

Ключевые слова: лексема, кластер, локальный коэффициент кластеризации, нейронная сеть, локальный коэффициент кривизны.

CLUSTERING DOCUMENTS USING THE NEURAL NETWORKS

Gabdrakhmanova N.T., candidate of technical Sciences, associate Professor, Department of higher mathematics, peoples ' friendship University of Russia (RUDN), Moscow

Abstract. A new algorithm for clustering documents based on neural networks, weighted graphs, and adjacency matrices is proposed. Neural networks derive their power from a parallel processing method and the ability to self-learn. The construction of a weighted graph for the document assumes the solution of the task of formalizing the object of modeling. The following clustering algorithm is proposed. Suppose we have N documents. We use these documents to get the training array of our neural network. Let each document already be divided into lexemes. A lexeme is a unit of the vocabulary of a language. A lexeme is the totality of the forms of a single word. For each document a weighted graph is constructed according to the following rule: the vertices of the graph are lexemes; the vertices of the graph are connected by an edge if the lexemes meet in the same sentence; the weight of the edge is the relative frequency of the lexemes in the text. In the tasks of clustering, we call the connective words in the text the "noise", i.e. such words as "so", "however", etc. In order to smooth "noise" we use filtering. We set an unspecified limit h, remove all edges with weight less than h. Base on the constructed weighted graph, we write the adjacency matrix Ai, where i is the document number. To every adjacency matrix Ai we associate the class of the document Yi. We obtain the tuples , i = 1,2, ... N for training the neural network. After training the neural network, it can be used to cluster documents. At the input of the neural network, the adjacency matrix of the document is fed, at the output — the document class number. In the future, it is proposed to develop the proposed clustering approach using the methods of modern geometry.

Keywords: lexeme, cluster, weighted graph, adjacency matrix, neural network.