• 2024 年 12 月 22 日

深度学习在机器健康状态中的应用

自动编码器(Auto-encoders)及其变体

自动编码器作为一种前馈神经网络,由编码器和解码器两模块组成,目的是通过重构输入的数据以学习新的数据。在这一过程中我们将编码器接受的输入记为[katex]\textbf{x}[/katex],通过非线性映射将其转化为隐藏表示的[katex]\textbf{h}[/katex],其二者关系如下所示:

As a feed-forward neural network, auto-encoder consists of two phases including encoder and decoder, which is designed to learn a new representation of the data by trying to reconstruct the input data. Encoder takes an input [katex]\textbf{x}[/katex] and transforms it to a hidden representation [katex]\textbf{h}[/katex] via a non-linear mapping as follows:

\textbf{h} = \varphi(\textbf{Wx}+\textbf{b})

其中[katex]\varphi[/katex]表示非线性激活函数,常见的激活函数有softmax、relu、tanh、sigmoid等。然后解码器将隐藏的表示映射回原始表示,方法如下所示:

where [katex]\varphi[/katex] is a non-linear activation function. The commonly used activation functions include softmax, relu, tanh, sigmoid and so on. Then, decoder maps the hidden representation back to the original representation in a similar way as follows:

\textbf{z} = \varphi(\textbf{W}^\prime\textbf{h}+\textbf{b}^\prime)

模型参数应包含[katex]\theta = [\textbf{W} , \textbf{b} , \textbf{W}^\prime , \textbf{b}^\prime][/katex]优化至最小重构误差介于[katex]\textbf{z} = f_\theta(\textbf{x})[/katex]和[katex]\textbf{x}[/katex]。对于一个有[katex]N[/katex]个样本的数据集的平均重构误差,常用的度量方法是计算其平方误差,其优化问题可计作:

Model parameters including [katex]\theta = [\textbf{W} , \textbf{b} , \textbf{W}^\prime , \textbf{b}^\prime][/katex]  are optimized to minimize the reconstruction error between [katex]\textbf{z} = f_\theta(\textbf{x})[/katex] and katex]\textbf{x}[/katex]. One commonly adopted measure for the average reconstruction error over a collection of [katex]N[/katex] data samples is squared error and the corresponding optimization problem can be written as follows:

\min \frac{1}{N} \sum^n_i\lVert\textbf{x}_i-f_\theta(\textbf{x}_i)\rVert^2_2

其中[katex]\textbf{x}_i[/katex]为第[katex]i[/katex]个样本。显然可以看出AE可以在无监督状态下进行训练,隐藏表示的[katex]\textbf{h}[/katex]可以认为是对样本[katex]\textbf{x}[/katex]更具有意义和更抽象的表示。

where [katex]\textbf{x}_i[/katex] is the [katex]i[/katex]-th sample. It is clearly shown that AE can be trained in an unsupervised way. The hidden representation [katex]\textbf{h}[/katex] can be regarded as a more abstract and meaningful representation for data sample [katex]\textbf{x}[/katex].

增加稀疏表征

为了防止学习到的变换成为恒等式,使自编码器规范化,对隐藏单元施加稀疏性约束,对应的优化函数更新为:

To prevent the learned transformation to be the identity one and regularize auto-encoders, the sparsity constraint is imposed on the hidden units. The corresponding optimization function is updated as:

\min_\theta \frac{1}{N}\sum^N_i\lVert\textbf{x}_i-f_\theta(\textbf{x}_i)\rVert^2_2+\beta\sum^m_jKL(p\Vert p_j)

其中[katex]m[/katex]为隐藏层的大小,第二项为隐藏单元上的KL散度的总和,[katex]\beta[/katex]为稀疏性惩罚项的控制权重。第[katex]j[/katex]个隐藏神经元上KL散度为:

where [katex]m[/katex] is the hidden layer size, the second term is the summation of the KL-divergence over the hidden units and [katex]\beta[/katex] is a controlling weight for the sparsity penalty term. The KL-divergence on [katex]j[/katex]-th hidden neuron is given as:

KL(p\Vert p_j) = p\log\left(\frac{p}{p_j}\right)+(1-p)\log\left(\frac{1-p}{1-p_j}\right)

其中[katex]p[/katex]为预定义的平均激活目标,[katex]p_j[/katex]是整个数据集上第[katex]j[/katex]个隐藏神经元的平均激活。在[katex]p[/katex]较小的情况下增加稀疏性约束可以使得学习到的隐藏表示成为稀疏表示。因此,AE的变体被称为稀疏自编码器.

where [katex]p[/katex] is the predefined mean activation target and [katex]p_j[/katex] is the average activation of the [katex]j[/katex]-th hidden neuron over the entire dataset. Given a small [katex]p[/katex], the addition of sparsity constraint can lead the learned hidden representation to be a sparse rep- resentation. Therefore, the variant of AE is named as sparse auto-encoder.

增加去噪

区别于传统自动编码器(AE),降噪自动编码器将受损的数据作为输入,并从受损的样本中重建或去噪生成干净的信号。最常见的是衰减噪声或二值掩蔽噪声,它将输入特征的一部分随机置零。去噪自编码器(DA)是AE的变体,它可以学习更加具有鲁棒性的表征,防止其学习出恒等变换式。

叠加结构

几个DA可以叠加在一起组成深度网络,并通过输入[katex]l[/katex]个输出作为输入给到[katex]l+1[/katex]层。然后逐层训练。
Several DA can be stacked together to form a deep network and learn representations by feeding the outputs of the [katex]l[/katex]-th layer as inputs to the [katex]l+1[/katex]-th layer. And the training is done one layer greedily at a time.

由于自动编码器可以采用无监督学习到方法训练以来,因此自动编码器,特别是堆叠去噪自动编码器(SDA),通过初始化深度神经网络(DNN)的权值对模型进行训练,可以提供一种有效的训练前解决方案,对SDA进行逐层训练后,自动编码器参数可设置为DNN所有隐藏层的初始化参数。然后对有监督的训练数据进行有监督微调,使预测误差最小化。通常,一个softmax或回归层是添加在网络最顶层用以将最后一层的输出映射到目标。所有的过程如下图1所示。与任意随机初始化相比,基于SDA的预训练协议可以使DNN模型具有更好的收敛能力。需要注意的是,由于这些常用的tanh或sigmoid非线性激活函数,训练深度神经网络往往会遇到梯度消失或爆炸问题。因此AE支持的无监督训练是有意义和强大的。而relu激活在一定程度上解决了2012年提出的这一问题。深度神经网络的监督训练,使得深度卷积神经网络和递归神经网络成为可能(见下图2)。

Since auto-encoder can be trained in an unsupervised way, auto-encoder, especially stacked denoising auto-encoder (SDA), can provide an effective pre-training solution via initializing the weights of deep neural network (DNN) to train the model. After layer-wise pre-training of SDA, the parameters of auto-encoders can be set to the initialization for all the hidden layers of DNN. Then, the supervised fine-tuning is performed to minimize prediction error on a labeled training data. Usually, a softmax/regression layer is added on top of the network to map the output of the last layer in AE to targets. The whole process is shown in Fig. 1. The pre-training protocol based on SDA can make DNN models have better convergence capability compared to arbitrary random initialization. It should be noted that training deep neural networks often suffers from gradient vanishing/exploding problems due to these commonly adopted tanh or sigmoid nonlinear activation functions. Therefore, unsupervised training enabled by AE is meaningful and powerful. However, relu activation relieved this problem, which was proposed in 2012. Supervised training of deep neural networks such as deep convolutional neural network and recurrent neural network became possible (see Fig. 2).

图1 SAE-DNN (a)和DBN-DNN (b)的无监督前训练和监督微调示意图
图2 RBM、DBN和DBM的框架,带阴影的方框表示隐藏的单元

RBM及其变体

作为一种特殊的马尔可夫随机场,受限玻耳兹曼机(restricted Boltzmann machine, RBM)是一种两层神经网络,它由两组单位组成,包括可见单位v和隐藏单位h,约束条件是可见单位与隐藏单位之间存在对称连接,节点与组之间不存在连接。

As a special type of Markov random field, restricted Boltzmann machine (RBM) is a two-layer neural network forming a bipartite graph that consists of two groups of units including visible units v and hidden units h under the constraint that there exists a symmetric connection between visible units and hidden units and there are no connections between nodes with a group.

给定模型参数[katex]\theta = [\textbf{W}, \textbf{b}, \textbf{a}][/katex],能量函数可以给出如下式:

Given the model parameters [katex]\theta = [\textbf{W}, \textbf{b}, \textbf{a}][/katex], the energy function can be given as:

E(\textbf{v}, \textbf{h},\theta) = -\sum^I_{i = 1}\sum^J_{j = 1}w_{ij}v_ih_j-\sum^I_{i = 1}b_iv_i-\sum^J_{j = 1}a_jh_j

其中[katex]w_{ij}[/katex]为总数为[katex]I[/katex]可见单元[katex]v_i[/katex]与总数为[katex]J[/katex]的隐藏单元[katex]h_j[/katex]之间的连接权重,[katex]b_i[/katex]和[katex]a_j[/katex]分别表示可见单位和隐藏单位的偏差项。根据能量函数[katex]E(\textbf{v}, \textbf{h}, \theta)[/katex]计算所有单元的联合分布如下:

that [katex]w_{ij}[/katex] is the connecting weight between visible unit [katex]v_i[/katex], whose total number is [katex]I[/katex] and hidden unit [katex]h_j[/katex] whose total number is [katex]J[/katex]; [katex]b_i[/katex] and [katex]a_j[/katex] denote the bias terms for visible units and hidden units, respectively. The joint distribution over all the units is calculated based on the energy function [katex]E(\textbf{v}, \textbf{h}, \theta)[/katex] as:

p(\textbf{v}, \textbf{h}, \theta) = \frac{\exp(-E(\textbf{v}, \textbf{h}, \theta))}{Z}

其中[katex]Z=\sum_{\textbf{h};\textbf{v}}\exp(-E(\textbf{v}, \textbf{h}, \theta))[/katex]为分配函数或归一化。则隐藏单元[katex]h[/katex]和[katex]v[/katex]的条件概率可以计算为:

where [katex]Z=\sum_{\textbf{h};\textbf{v}}\exp(-E(\textbf{v}, \textbf{h}, \theta))[/katex] is the partition or normalization factor. Then, the conditional probabilities of hidden and visible units [katex]h[/katex] and [katex]v[/katex] can be calculated as:

\begin{alignedat}{}
p(h_j = 1\vert v;\theta) = \delta\left({\sum^I_{i=1}w_{ij}v_i+a_j}\right) \\
p(v_i = 1\vert v;\theta) = \delta\left({\sum^J_{j=1}w_{ij}h_j+b_i}\right) \\
\end{alignedat}

其中[katex]\delta[/katex]被定义为logistic函数,即:[katex]\delta(x) = \frac{1}{1+\exp(x)}[/katex]。训练RBM使得到的联合概率最大,对[katex]\textbf{W}[/katex]对学习是通过一种叫做对比发散(contrastive divergence, CD)的方法完成的。

where [katex]\delta[/katex] is defined as a logistic function, i.e., [katex]\delta(x) = \frac{1}{1+\exp(x)}[/katex]. RBM is trained to maximize the joint probability. The learning of [katex]\textbf{W}[/katex] is done through a method called contrastive divergence (CD)

深度信念网络(DBN, Deep Belief Network)

深度信念网络(Deep belief network, DBN)可以通过叠加多个RBM来构建,其中第[katex]l[/katex]层的输出(隐藏单元)用作第[katex]l+1[/katex]层(可见单元)的输入。与SDA类似,DBN可以用一种贪婪的分层无监督方式进行训练。预训练后,该深度架构的参数可以针对DBN log-likelihood的代理进一步微调,也可以针对训练数据的标签,在顶层添加一个softmax层作为顶层,如图1(b)所示。

深度玻尔兹曼机(DBM, Deep Boltzmann Machine)

深层玻尔兹曼机(Deep Boltzmann machine, DBM)可以看作是一种深层结构的rmb,其中隐藏单元被分组成多层结构,而不是单一的层。由于受限玻尔兹曼机RBM的连通性限制,后续层之间只有全联接,层内和非相邻层之间不允许有连通性。DBN和DBM的主要区别在于DBM是完全无向图模型,而DBN是有向/无向混合图模型。不同于DBN, DBM是作为一个联合模型进行训练的。因此,DBM的训练比DBN的训练在计算量上更大。

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注