赞
踩
We are performing the digital transition of industry, living the 4th industrial revolution, building a new World in which the digital, physical and human dimensions are interrelated in complex socio-cyber-physical systems. For the sustainability of these transformations, knowledge, information and data must be integrated within modelbased and data-driven approaches of Prognostics and Health Management (PHM) for the assessment and prediction of structures, systems and components (SSCs) evolutions and process behaviors, so as to allow anticipating failures and avoiding accidents, thus, aiming at improved safe and reliable design, operation and maintenance. There is already a plethora of methods available for many potential applications and more are being developed: yet, there are still a number of critical problems which impede full deployment of PHM and its benefits in practice. In this respect, this paper does not aim at providing a survey of existing works for an introduction to PHM nor at providing new tools or methods for its further development; rather, it aims at pointing out main challenges and directions of advancements, for full deployment of condition-based and predictive maintenance in practice.
题目:预测与健康管理(PHM):理论与实践的现状与未来方向
摘要:我们正在进行工业数字化转型,经历第四次工业革命,建立一个新的世界,在这个世界中,数字化、物理化和人类维度在复杂的社会-网络-物理系统中相互关联。为了这些转型的可持续性,知识、信息和数据必须在基于模型和数据驱动的预测与健康管理(PHM)方法中进行整合,以评估和预测结构、系统和部件(SSC)的演变和过程行为,从而实现预先预测故障并避免事故,进而在设计、运行和维护中提高安全性和可靠性。已经有许多方法可用于许多潜在应用,并且还在开发更多:然而,仍然存在一些关键问题,阻碍了PHM在实践中的全面部署和效益。在这方面,本文旨在指出主要挑战和进展方向,以实现基于状态和预测维护的全面实践。
Prognostics and Health Management (PHM) is a computation-based paradigm that elaborates on physical knowledge, information and data [1] of structures, systems and components (SSCs) operation and maintenance, to enable detecting equipment and process anomalies, diagnosing degradation states and faults, predicting the evolution of degradation to failure so as to estimate the remaining useful life (Fig. 1). The outcomes of the PHM elaboration are used to support condition-based and predictive maintenance decisions for the efficient, reliable and safe operations of SSCs [2][3][4][5]. In fact, the capability of deploying these maintenance strategies provides the opportunity of setting efficient, just-in-time and just-right maintenance strategies: in other words, providing the right part to the right place at the right time. This opportunity is big because doing this would maximize the production profits and minimize all costs and losses, including asset ones [6]. As a result, in the past decade PHM research and development has intensified, both in academia and industry, involving various disciplines of mathematics, computer science, operations research, physics, chemistry, materials science, engineering, etc. [7,8].
For making reliability and safety decision using PHM outcomes in practice, identifying, understanding and quantifying the impacts and benefits that the development of a PHM system can have on the health management of a SSC is necessary (e.g. avoid unexpected catastrophic failures, reduce maintenance frequency, optimize spare parts and storage, optimize resources, etc.). Then, the practical implementation of PHM includes data acquisition to enable detection, diagnostics and prognostics tasks, and maintenance decision making [9] (Fig. 1). The supporting PHM development framework (Fig. 2) and its requirements must, then, be properly defined to perform well in real industrial scenarios [9][10][11]. Given the increasing complexity, integration and informatization of modern engineering SSCs, PHM can no longer be an isolated addition for supporting maintenance but must be closely linked to the other structure, power, electromechanical, information and communication technology (ICT), control parts of the systems. Then, PHM must be included at the beginning of the system conceptualization, and carried through its design and development in an integrated framework capable of satisfying the overall operation and performance requirements [12,13].
Finally, for the use of PHM in practice, the question of which methods to use is fundamental. For example, referring, in particular, to the prognostic task of PHM, the prediction capability of a prognostic method refers to its ability to provide trustable predictions of the Remaining Useful Life (RUL), with the quality characteristics and confidence level required for making decisions based on such predictions. Indeed, this heavily influences the decision makers’ attitude toward taking the risk of using the predicted RUL outcomes to inform their decisions [14]. The choice of which method to use is typically driven by the data available and/or the physics-based models available, and the cost-benefit considerations related to the implementation of the PHM system. A set of Prognostic Performance Indicators (PPIs) must be used to guide the choice of the approach to be implemented, within a structured framework of evaluation. These PPIs measure different characteristics of a prognostic approach and need to be aggregated to enable a final choice of prognostic method, based on its overall performance [15]. For this reason, various performance metrics have been defined to enable the evaluation of the performance of PHM methods [16]. These metrics are needed to guide the PHM system development (Fig. 2).
Up to now, for the maturation of PHM, the main efforts have been mainly devoted to the development of hardware (i.e., Internet of Things (IoTs), smart meters, etc. [17][18][19][20] and software for tracking the health state of monitored equipment (e.g., data analytics, platforms for IoT interconnection and clouding for computing, etc. [21][22][23]). On the other hand, the full deployment of PHM in practice involves other aspects, including design (e.g. the use of smart components may lead to different reliability allocation solutions), and impacts various work units involved in maintenance decisions and actuations (e.g., workers can use smart systems, maintenance engineers can analyze big data), including the supporting logistics (spare parts availability and warehouse management can be driven by the PHM results) [17].
In this paper, we present some main challenges for the development of PHM in practice, corroborated by practical examples, and associate to some of them the developments of Recurrent Neural Networks (RNNs), Reservoir Computing (RC), Generative Adversarial Networks (GANs), Deep Neural Networks (DNNs), Optimal Transport Theory (OTT), as potential directions to successfully address them.
健康管理与预测(PHM)是一种基于计算的范式,通过对结构、系统和部件(SSC)的运行和维护的物理知识、信息和数据的详尽阐述,能够检测设备和流程异常,诊断退化状态和故障,预测退化到故障的演变,从而估计剩余可用寿命(图1)。PHM的推导结果被用于支持基于状态的和预测性维护决策,以实现SSC的高效、可靠和安全运行[2][3][4][5]。事实上,部署这些维护策略的能力提供了建立高效、及时和合适的维护策略的机会:换句话说,根据需要在正确的地方、正确的时间提供正确的零部件。这个机会因为最大化生产利润并最小化所有成本和损失,包括资产损失,而变得重要[6]。因此,在过去的十年里,PHM的研究和开发在学术界和工业界都得到了加强,涉及了数学、计算机科学、运筹学、物理学、化学、材料科学、工程等各种学科[7,8]。
为了在实践中使用PHM结果进行可靠性和安全性决策,有必要确定、理解和量化PHM系统发展对SSC健康管理可能产生的影响和益处(例如,避免意外的灾难性故障,减少维护频率,优化备件和库存,优化资源等)。然后,PHM的实际实施包括数据采集,以支持检测、诊断和预测任务以及维护决策制定[9](图1)。支持PHM发展的框架(图2)及其要求必须被合理定义,以在真实的工业场景中发挥作用[9][10][11]。鉴于现代工程SSC的日益复杂、集成和信息化,PHM不再是一个孤立的增加,来支持维护,而是必须与其他系统的结构、动力、电机、信息和通信技术(ICT)和控制部分紧密联系在一起。因此,PHM必须在系统概念化的初期就被纳入,并且在整个设计和开发过程中,以能够满足整体操作和性能要求的集成框架中进行[12,13]。
最后,在实践中使用PHM的问题,选择使用哪种方法是至关重要的。例如,特别是涉及到PHM的预测任务时,预测方法的预测能力是指其提供可信的剩余可用寿命(RUL)预测的能力,具有所需的质量特征和置信水平,以便根据这些预测进行决策。事实上,这极大地影响决策者采取使用预测的RUL结果来指导决策的风险态度[14]。选择使用哪种方法通常由可用的数据和/或可用的基于物理的模型以及与实施PHM系统相关的成本效益考虑驱动。必须使用一组预测性能指标(PPIs)来指导实施方法的选择,在评估的结构化框架内进行聚合。这些PPIs衡量预测方法的不同特征,并且需要进行聚合以实现对预测方法的最终选择,基于其整体性能[15]。因此,为了指导PHM系统的开发(图2),已经定义了各种性能指标,以评估PHM方法的性能[16]。
迄今为止,为了推动PHM的发展,主要的努力主要集中在硬件(例如物联网(IoTs),智能仪表等[17][18][19][20])和用于跟踪监测设备健康状况的软件(例如数据分析,物联网互连和云计算平台等[21][22][23])的开发上。另一方面,在实践中完全部署PHM还涉及其他方面,包括设计(例如,使用智能部件可能会导致不同的可靠性分配解决方案),以及对维护决策和操作涉及的各个工作单元的影响(例如,工人可以使用智能系统,维护工程师可以分析大数据),包括支持的物流(PHM的结果可以驱动备件可用性和仓库管理)[17]。
在本文中,我们介绍了PHM在实践中发展的一些主要挑战,并结合实际例子,提出了循环神经网络(RNNs)、储层计算(RC)、生成对抗网络(GANs)、深度神经网络(DNNs)和最优传输理论(OTT)的发展作为解决这些挑战的潜在方向。
Main challenges to the deployment of PHM in practice still remain, coming from different sides:
• the physics of the problem • the data available • the requirements of the solutions.
The challenges related to the physics of the problem derive from the complexity of the SSCs degradation processes, which are not completely known, dynamic and highly non-linear, and hence their understanding, characterization and modeling are difficult.
The challenges related to the data relate to multiple aspects (Fig. 3): • the many anomalies in the real data collected in the field (including missing data and erroneous data from malfunctioning sensors) • the scarcity and incompleteness of data recognizably related to the state of degradation of the SSC of interest (labelled patterns) • the difficulty of managing and treating big data, with a large variety of signals collected by sensors of different types • the changing operational and environmental conditions which affect the data used to train the PHM models and calibrate their parameters, and on which the models are applied.
The challenges related to the requirements of the PHM solutions come from the multiple objectives that they must achieve, depending on the applications. The obvious ones are accuracy and precision, quantified with defined performance indicators and measured against the decisions that they support: in some cases, very high accuracy and precision is required to be able to take confident decisions (e.g. of stopping a system upon an alert of fault detection, of replacing a component upon a fault diagnosis, of anticipating or postponing a scheduled maintenance based on accurate remaining useful life predictions); in other cases, accuracy and precision need not be so high, and may be compromised for other objectives. For example, transparency, explainability and interpretability of PHM models are attributes of particular interest, if not demanded, for decision making in safetycritical applications, for which they may also be a regulatory prerequisite. Also, PHM as a data-dependent enabling technology for smart condition-based and predictive maintenance has issues regarding security. Indeed, the technological network supporting PHM is made of devices, communication technologies and various protocols, so that security issues regarding availability, data integrity, data confidentiality and authentication exist. As these issues hamper operational efficiency, robustness and throughput, they must be adequately addressed.
Finally, an enveloping challenge to the deployment of PHM in practice comes from the fact that the PHM tasks of fault detection, diagnostic and prognostic are inevitably affected by various sources of uncertainty, such as incomplete knowledge on the present state of the equipment, randomness in the future operational usage profile and future evolution of the degradation of the equipment, inaccuracy of the PHM model and uncertainty in the values of the signal measurements used by the PHM model to elaborate its outcomes, etc. Therefore, any outcome of a PHM model should be accompanied by an estimate of its uncertainty, in order to confidently take robust decisions based on such outcome, considering the degree of mismatch between the PHM model outcomes and the real values.
As these issues hamper operational inefficiency, robustness and throughput, they must be adequately addressed.
With specific reference to data-driven methods and models for the tasks of fault detection, fault diagnostics and failure prognostics in PHM, the next section addresses some of the above challenges with the focus on advanced methods that are proving as promising for their solution.
在实际应用中,部署PHM面临着一些主要挑战,这些挑战来自不同方面:
• 问题的物理本质
• 可用的数据
• 解决方案的要求
与问题的物理本质相关的挑战源于SSCs退化过程的复杂性,这些过程尚不完全了解且具有动态和高度非线性特性,因此很难理解、表征和建模。
与数据相关的挑战涉及多个方面(图3):
• 在实际场景中收集的真实数据存在许多异常情况(包括来自故障传感器的缺失数据和错误数据)
• 可以明确与感兴趣SSC的退化状态相关的数据的稀缺性和不完整性(标记模式)
• 大数据的处理和管理的困难,由不同类型传感器收集的大量信号
• 运行和环境条件的变化会影响用于训练PHM模型和校准其参数、以及应用模型的数据。
与PHM解决方案要求相关的挑战来自于其必须实现的多个目标,具体取决于应用场景。明显的目标是准确性和精确性,通过定义的性能指标进行量化,并根据其支持的决策进行测量:在某些情况下,需要非常高的准确性和精确性才能做出有信心的决策(例如,在故障检测警报时停止系统,根据故障诊断更换部件,根据准确的剩余寿命预测提前或延迟计划维护);在其他情况下,准确性和精确性可能不需要那么高,并且可能因其他目标而受到影响。例如,透明度、解释性和可解释性在PHM模型对安全关键应用中的决策制定中具有特别重要性,甚至可能是监管要求。此外,作为一种依赖数据的智能条件监控和预测维护的使能技术,PHM在安全方面还存在安全性问题。事实上,支持PHM的技术网络由设备、通信技术和各种协议组成,因此存在与可用性、数据完整性、数据保密性和身份验证相关的安全问题。由于这些问题会影响操作效率、鲁棒性和吞吐量,因此必须适当地加以解决。
最后,PHM在实践中面临一个包围性的挑战,即故障检测、诊断和预测任务不可避免地受到各种不确定性源的影响,例如对设备当前状态的不完全了解、未来操作使用情况和设备退化未来演变的随机性、PHM模型的不准确性以及PHM模型用于推断结果的信号测量值的不确定性等。因此,PHM模型的任何结果都应伴随着对其不确定性的估计,以便基于该结果自信地做出稳健的决策,并考虑PHM模型结果与实际值之间的不匹配程度。
由于这些问题会影响操作效率、鲁棒性和吞吐量,因此必须适当地加以解决。
在针对PHM中故障检测、故障诊断和故障预测任务的数据驱动方法和模型方面,接下来的部分将重点讨论一些上述挑战,着重介绍一些被证明有希望解决这些挑战的先进方法。
Methods of fault detection, fault diagnostics and failure prognostics within the PHM framework are continuously being developed and advanced, and applications to various SSCs are being deployed, supported by the technology of sensors and monitoring systems, the techniques of data analytics, image processing and text mining, mostly based on the Artificial Intelligence (AI) and Machine Learning (ML) paradigms, and the computational power [24]. The objective of fault detection is to recognize abnormaities/anomalies in SSCs behavior. The objective of fault diagnostics is to identify the SSCs degradation states and the causes of degradation. Prognostics aims at predicting the SSCs Remaining Useful Life (RUL), i.e. the time left before it will no longer be able to perform its intended function. Fault detection and diagnostics, and failure prognostics are the enablers of condition-based and predictive maintenance, which offers major opportunities for Industry 4.0 and smart SSCs, as they can allow reducing failures, increasing SSCs usage, and reducing operation and maintenance costs, with tangible benefits of reduction of production downtime, risk and asset losses, and consequent Fig. 1. PHM tasks. The data collected from industrial component sensors feeds three major PHM tasks: fault detection (anomaly detection), fault diagnostics (degradation level assessment) and fault prognostics (remaining useful life prediction). The successful deployment of PHM provides solid foundations for the optimal maintenance decisions, and thus improve the safety of industrial SSCs while reducing cost.
PHM实践中的故障检测、故障诊断和故障预测方法不断发展和改进,并应用于各种组件(SSCs)。这些方法得到传感器和监测系统技术的支持,依靠数据分析、图像处理和文本挖掘技术,主要基于人工智能(AI)和机器学习(ML)范式以及计算能力 [24]。故障检测的目标是识别SSCs行为中的异常/异常情况。故障诊断的目标是确定SSCs的退化状态和退化原因。预测旨在预测SSCs的剩余可用寿命(RUL),即在无法执行其预期功能之前的剩余时间。故障检测、故障诊断和故障预测是基于状态的和预测性维护的推动因素,为工业4.0和智能SSCs提供了重要机会,因为它们可以减少故障、增加SSCs利用率,并降低运营和维护成本,从而减少生产停机、风险和资产损失以及相关的可能的。 PHM任务,如图1所示。从工业组件传感器收集的数据提供了PHM的三个主要任务:故障检测(异常检测)、故障诊断(退化水平评估)和故障预测(剩余可用寿命预测)。成功部署PHM为最佳维护决策提供了坚实基础,从而提高了工业SSCs的安全性,同时降低了成本。
increase of production profit [24].
A number of challenges still remain, arising from the complexity of the physics which PHM is addressed to in practice, from the data available and from the requirements to the PHM solutions for practical applications. In this Section, we go through some of these challenges, to see where we stand, and where we are going and need to go.
增加生产利润[24]。
目前仍然存在一些挑战,这些挑战源于PHM在实践中所面对的物理复杂性,可用的数据以及对实际应用的PHM解决方案的要求。在本节中,我们将讨论其中的一些挑战,以了解我们的现状以及我们的发展方向和需求。
As mentioned above, fault detection is the PHM task which aims at identifying the presence of abnormalities/anomalies during the operation of a SSC. While such abnormalities/anomalies are commonly referred to as faults in certain disciplines, such as energy and mechanical engineering, the term damage is commonly used in some other disciplines such as structural engineering. In practical applications, fault/ damage detection is challenging because it is necessary to assess the presence of the fault/damage based on signals of physical variables measured during the SSC operation and such process is complicated by the various sources of uncertainty that can render the signal processing extremely difficult.
Fault detection methods are classified as model-based and datadriven [25]. Model-based methods use first principles and physical laws to describe the physical phenomena and processes of interest [26][27][28]. For example, [26] builds a model of the behavior of a rotor using the finite element method and successfully applies it to fault detection [27] introduces a model-based fault detection and isolation technique for manufacturing machinery based on a defined relationship between a fault signal and observer theory [28] presents a two-level Bayesian approach based on the use of Hidden Markov Model (HMM) and Expectation Maximization (EM) to detect early faults in a milling machine. However, the practical application of model-based methods is limited by the difficulty of developing accurate mathematical models of the processes and behaviors of complex modern SSCs [29].
如上所述,故障检测是PHM的任务之一,其目的在于在SSC运行期间识别异常/异常行为的存在。虽然在某些学科中,如能源和机械工程,此类异常/异常行为通常被称为故障,但在一些其他学科,如结构工程中,常用术语是损伤。在实际应用中,故障/损伤的检测具有挑战性,因为需要根据SSC运行期间所测量的物理变量信号来评估故障/损伤的存在,而各种不确定性源可能使信号处理极为困难。
故障检测方法分为基于模型和数据驱动方法[25]。基于模型的方法利用基本原理和物理定律来描述感兴趣的物理现象和过程[26][27][28]。例如,[26]利用有限元法建立了转子行为的模型,并成功应用于故障检测[27]介绍了一种基于故障信号与观测者理论之间定义关系的基于模型的故障检测与隔离技术[28]提出了一种两级贝叶斯方法,基于隐马尔可夫模型(HMM)和期望最大化(EM)来检测铣床的早期故障。然而,基于模型的方法的实际应用受到发展复杂现代SSC过程和行为准确数学模型的困难所限制[29]。
For this reason, data-driven fault detection methods are more popular than model-based ones, as they rely only on data for the recognition of anomalous patterns attributable to faults [30][31][32][33][34][35]. For example, [30] develops a fault detection method for power generation systems, by combining Principle Component Analysis (PCA) for feature extraction and Random Forest (RF) for fault behavior pattern learning. Support Vector Machine (SVM) techniques are introduced to detect faults considering concept drift in nuclear power plants [31], and to detect faults in high speed train brake systems in case of highly imbalanced data [35]. Neural Network based approaches attract attention in fault detection, e.g. [32] combines a set of Artificial Neural Networks (ANNs) through Bayesian statistics for heavy-water nuclear reactor fault detection and uncertainty quantification, [34] uses ANN to detect false alarms in wind turbines for reliability centered maintenance, [33] introduces a Recurrent Neural Network (RNN) with optimized hyperparameters for the detection of software faults.
因此,与基于模型的方法相比,数据驱动的故障检测方法更受欢迎,因为它们仅依赖于数据来识别与故障有关的异常模式[30][31][32][33][34][35]。例如,[30]通过将主成分分析(PCA)用于特征提取和随机森林(RF)用于故障行为模式学习,开发了一种用于发电系统的故障检测方法。支持向量机(SVM)技术被引入到核电站中以考虑概念漂移的情况下检测故障[31],并在高速列车制动系统中检测高度不平衡数据的故障[35]。基于神经网络的方法在故障检测中引起了关注,例如,[32]结合贝叶斯统计学通过一组人工神经网络(ANN)来检测重水核反应堆的故障,并量化不确定性,[34]使用ANN检测风力涡轮机中的误报以进行可靠性中心化维护,[33]引入了经过优化超参数的递归神经网络(RNN)用于软件故障的检测。
These methods can be divided in those which rely on one-class classification models and those which use residuals, i.e., the differences between the real measurements and the reconstructed values of the signals in normal conditions, to identify the normal/abnormal conditions [36].
这些方法可以分为依赖于一类分类模型的方法和使用残差的方法。即,使用正常条件下的实际测量值与重建值之间的差异来识别正常/异常条件[36]的方法。
The former require training of a one-class classification model on signal measurements collected from both normal (healthy) and abnormal/anomalous (faulty) conditions of SSCs. However, in practical applications, faults are rare and the data have manifold distributions embedded in high-dimensional spaces. Distributions with non-smooth densities and the curse of dimensionality of the data in the long-term multivariate time series collected from sensors on real industrial SSCs, can cause model overfitting and render difficult the empirical reconstruction of the data distribution, which, therefore, leads to unsuccessful detection of abnormal/anomalous conditions in SSCs behavior. These technical issues hamper the successful deployment of one-class classification methods for fault detection in practical applications. The need is, then, to develop methods able to detect anomalous (faulty) conditions given data in normal conditions, and to deal with the manifold distribution and large dimensionality of real data. In this direction, Generative Adversarial Networks (GANs) are an interesting perspective as they can be used to reproduce complex distributions, e.g. manifolds [37,38]. An example is given in the work by [39], which proposes an Auto-Encoder aided GAN (AE-GAN) model for the detection of abnormal/anomalous conditions in the behavior of a SSC, in which the generator of the GAN and an auxiliary encoder form an AE module, and the reconstruction error generated by the AE is used as score to detect abnormalities/anomalies in the SSC behavior. Adaptive noise is added on the data and AdaBoost ensemble learning is adapted to integrate the AE-GANs applied to detect anomalies in each small time slice of the long-term multivariate time series collected by the sensors [40]. Furthermore, this work derives a lower bound of Jensen-Shannon divergence between generator distribution and normal data distribution as an objective to optimize the AE-GANs hyperparameters; by probing, the optimization works without test data, as commonly needed by other methods. Extensive experiments are conducted on real industrial datasets to demonstrate the usefulness of the developed Adaboost ensembled AE-GAN method for abnormality/anomaly detection in Fig. 2. PHM development framework for informed decision-making. Residual-based fault detection methods rely on the use of normalconditions (healthy) data, only [41]. These methods reconstruct the values of the signals expected in normal conditions and use the residuals, i.e., the differences between the real measurements and the reconstructed signals, to identify the normal/abnormal conditions. Examples of residual-based methods include Auto-Associative Kernel Regression (AAKR) [42][43][44], Principal Component Analysis (PCA) [45], One Class-Support Vector Machine (OC-SVM) [46], and Artificial Neural Networks (ANNs) [47]. The empirical model, fitted to the data so as to provide accurate signal reconstructions, plays an essential role in the above procedure. However, its training may require a large amount of healthy data collected under various operating conditions [48]. Besides, different choices of the reconstruction model may yield different detection results [49].
前者要求对来自SSCs的正常(健康)和异常(故障)条件下的信号测量进行单类分类模型的训练。然而,在实际应用中,故障是罕见的,数据在高维空间中嵌入了多种分布。实际产业SSCs上从传感器收集的长期多变量时间序列中,存在非平滑密度和数据的高维性,这可能导致模型过度拟合,并使得对数据分布的经验重建困难,从而导致无法成功检测SSCs行为中的异常情况。这些技术问题阻碍了一类分类方法在实际应用中成功部署进行故障检测。因此,需要开发能够在正常条件下检测异常(故障)情况的方法,并解决真实数据的多样性分布和大维度问题。在这方面,生成对抗网络(GANs)是一个有趣的视角,因为它们可以用于复制复杂的分布,例如流形。例如,[39]提出了一种Auto-Encoder辅助GAN(AE-GAN)模型,用于检测SSC行为中的异常条件。其中,GAN的生成器和辅助编码器形成AE模块,AE生成的重构误差被用作检测SSC行为中异常/异常情况的评分。在传感器收集的长期多变量时间序列的每个小时间片段中,向数据添加自适应噪声,并将AdaBoost集成学习用于整合应用AE-GAN检测异常。此外,该方法导出了生成器分布与正常数据分布之间的Jensen-Shannon散度下界作为优化AE-GAN超参数的目标;通过探测,优化可以在不需要测试数据的情况下进行,而其他方法通常需要测试数据。在真实工业数据集上进行了大量实验,以证明开发的Adaboost集成AE-GAN方法在异常检测中的有用性。基于残差的故障检测方法仅依赖正常条件(健康)下的数据。这些方法重建正常条件下的信号值,并使用残差(即实际测量值与重建信号之间的差异)来识别正常/异常条件。基于残差的方法的示例包括自动关联核回归(AAKR)、主成分分析(PCA)、单类支持向量机(OC-SVM)和人工神经网络(ANNs)。经验模型在上述过程中的拟合对于提供准确的信号重构起着关键作用。然而,其训练可能需要大量在各种工作条件下收集的正常数据。此外,不同的重构模型选择可能导致不同的检测结果。
Eventually, the detection of an abnormal condition is confirmed by considering whether the obtained residuals exceed a threshold or by statistical tests. For example, [43] uses the Sequential Probability Ratio Test (SPRT) on the residuals obtained from an AAKR model; [50] applies T2-and Q-statistics of the PCA residuals to detect damages in structures; [51] establishes a statistical hypothesis model in the residual subspace of PCA transform, to detect and isolate sensor faults based on a Bayesian formulation and the generalized likelihood radio test (GLRT). Notice that, although these methods assume a certain distribution of the residuals, most distributions of real-world data may be a priori unknown or may not actually follow the assumed distributions [52].
Another challenge of fault detection lies in the data pre-processing [53] to extract features providing the information useful for enabling the detection. Various pre-processing techniques, such as Fast Fourier Transform (FFT) [54], Continuous Wavelet Transform [55], Mathematical Morphology [56], have been applied to raw signals, and the processed outcomes have been fed to fault detection [57]. The quality of the features selected by pre-processing strongly impacts the detection results, but unfortunately there is no universal rule for choosing the optimal pre-processing method.
最终,通过比较获得的残差是否超过阈值或进行统计检验来确认异常状态的检测。例如,[43]使用顺序概率比例检验(SPRT)对从AAKR模型获得的残差进行检验;[50]应用PCA残差的T2统计量和Q统计量来检测结构损伤;[51]在PCA变换的残差子空间中建立统计假设模型,基于贝叶斯公式和广义似然比检验(GLRT)来检测和隔离传感器故障。需要注意的是,尽管这些方法假设残差具有某种分布,但实际世界数据的分布可能事先未知,或者实际上不符合所假设的分布[52]。
故障检测的另一个挑战在于数据预处理[53],用于提取有助于启用检测的信息的特征。各种预处理技术,如快速傅里叶变换(FFT)[54]、连续小波变换[55]、数学形态学[56],已应用于原始信号,并将处理后的结果馈入故障检测[57]。预处理所选择的特征的质量对于检测结果有很大的影响,但不幸的是,没有通用的规则来选择最佳的预处理方法。
Recently, transport-related methods are being considered for applications in PHM. They have already been successfully employed in other domains [58], involving signal and image processing [59], computer vision [60], machine learning and statistics [61,62]. Commonly used optimal transport distances include Wasserstein distance (or Kantorovich distance) [63] and Earth Mover’s distance (EMD) [64]. Wasserstein distance has proved a promising statistic for the nonparametric two-sample test [65].
In the PHM area, [66] has studied the bearing diagnostics problem using EMD combined with dynamical system reconstruction. [67] has used a PCA scheme combined with the Kantorovich distance (KD) for fault detection in the process industry. [68] has developed a method of OT in which the abnormality score is built using the Wasserstein distance and has verified its performance considering the detection of abnormal conditions in bearings. The method differs from other state-of-the-art methods for fault detection, since it directly deals with raw signals and does not require the use of signal reconstruction methods or feature extraction; it is also distribution-free, i.e., it does not require to formulate any a priori hypothesis on the distribution of the data. The basic idea behind the method is to generate an abnormality score, based on Wasserstein distance, to quantify the dissimilarity between the probability distributions of the currently monitored and healthy data. The Cumulative Distribution Transform (CDT) [69] is used to find the univariate Optimal Transport (OT) solution. The method has been applied to a real bearing dataset and successfully compared with two other fault detection methods of literature: a Z-test based method [70] and a PCA-based method for signal reconstruction, combined with the Q-statistic for residual analysis [71]. The Adaboost ensembled AE-GAN method mentioned earlier [39] can also be adapted for application to normal-conditions data only. The generator of the GAN and the auxiliary encoder form the AE module, and the reconstruction error generated by the AE is used as the score to detect abnormalities/anomalies in SSC behavior. For the abnormality/anomaly detection, it is assumed, as usual, that the probability distribution of the abnormal/anomalous-conditions data is significantly different from that of the normal-conditions data: as the generator can only reproduce the distribution of the normal-conditions data, the AE always successfully reconstruct the normal data but fails to reconstruct the abnormal ones. So, any test sample processed through the AE-GAN is declared anomalous if the AE reconstruction error is larger than a certain predefined threshold. For dealing with the high dimensionality of the data, again, an ensemble framework can be used. Non-overlapped sliding time windows are introduced to partition the multivariate time series and a separate data sample for each time window is analyzed by AE-GAN for abnormal/anomaly detection. Finally, the AdaBoost algorithm is used to aggregate the abnormality/anomaly detection results for each time window. The GAN-based method for addressing the challenge of missing fault data in fault detection is shown in Fig. 4. Table 1 summarizes the fault detection techniques, with specific regard to the challenge of missing fault data.
最近,运输相关的方法正在被考虑用于PHM的应用。它们已经在其他领域[58]中得到成功应用,包括信号和图像处理[59]、计算机视觉[60]、机器学习和统计学[61,62]。常用的最优传输距离包括Wasserstein距离(或Kantorovich距离)[63]和Earth Mover’s距离(EMD)[64]。Wasserstein距离已经被证明是非参数双样本检验的一种有前景的统计量[65]。
在PHM领域,[66]研究了使用EMD结合动态系统重建的轴承诊断问题。[67]使用了PCA方案与Kantorovich距离(KD)相结合来检测过程工业中的故障。[68]开发了一种OT方法,其中使用Wasserstein距离构建异常得分,并验证了其在轴承异常条件检测中的性能。该方法与其他最新的故障检测方法不同,因为它直接处理原始信号,不需要使用信号重建方法或特征提取;它也是无分布的,即不需要对数据的分布作出任何先验假设。该方法的基本思想是基于Wasserstein距离生成异常得分,以量化当前监测数据与健康数据之间的概率分布的不相似性。采用累积分布变换(CDT)[69]来寻找单变量最优传输(OT)解。该方法已应用于真实的轴承数据集,并与文献中的其他两种故障检测方法进行了成功比较:基于Z检验的方法[70]和基于PCA的信号重建方法,结合Q统计降余分析[71]。前面提到的Adaboost集成AE-GAN方法[39]也可以用于仅适用于正常条件下的数据。GAN的生成器和辅助编码器构成了AE模块,AE生成的重构误差被用作检测SSC行为中的异常的得分。对于异常检测,通常假设异常条件数据的概率分布与正常条件数据的概率分布显著不同:由于生成器只能再现正常条件数据的分布,AE始终能成功重构正常数据,但无法重构异常数据。因此,如果经过AE-GAN处理的测试样本的重构误差大于某个预定义的阈值,则被宣布为异常。为了处理数据的高维度,再次可以使用集合框架。引入非重叠滑动时间窗口来划分多变量时间序列,并通过AE-GAN对每个时间窗口的独立数据样本进行异常检测。最后,使用AdaBoost算法来聚合每个时间窗口的异常检测结果。图4展示了解决故障检测中缺失故障数据挑战的基于GAN的方法。表1总结了故障检测技术,特别关注缺失故障数据的挑战。
Fault diagnostics requires data analytics capable of identifying the Bayesian Network (BN) [74], Decision Trees (DT) [75], Linear Discriminant Analysis (LDA) [75], K-Nearest Neighbor (KNN) [75], Artificial Neural Networks (ANNs), Support Vector Machines (SVMs) [75][76][77], have been successfully used in applications of different industrial and civil sectors [78]. These methods rely on supervised learning of labelled data, which, however, are rarely available in practice, so that their real application is limited [79]: the real application calls for unsupervised learning of unlabeled data. Unsupervised learning is an important topic in machine learning for time series segmentation [80,81] and pattern recognition [21,82,83]. In fault diagnostic applications, it is used to provide abstract representations of the raw measurement data and obtain various clusters representing healthy and faulty conditions [22,[84][85][86]. In the work of [22], a Categorical Adversarial Autoencoder (CatAAE) has been proposed for unsupervised learning aimed at fault diagnostics of rolling bearings. In the work of [84], a diagnostic methodology based on unsupervised Spectral Clustering (SC) combined with fuzzy C-means (FCM) has been developed for identifying groups of similar shutdown transients performed by a nuclear turbine. In [85], Self-Organizing Map (SOM) has been used for clustering and identifying degradation states of a railway-signal system. In [86], a methodology combining k-means and Association Rule Mining (ARM) has been developed to mine failure data and diagnose interconnections between failure occurrences in wind turbines. Representation learning can disentangle the different explanatory factors of variation behind the data, making it easier to extract and organize the discriminative information when building fault diagnostic models [87][88][89][90][91][92]. In traditional unsupervised methods for fault diagnostics, the features are extracted applying ad hoc signal processing techniques to the collected signals, e.g. Fourier spectral analysis and Wavelet transformations [93]. The processing is heavily dependent on a priori knowledge and diagnostic expertise [22,94], and can be quite time consuming and labor-intensive [88]. Since representation learning is adaptively capable of learning features from raw data, it can constitute an excellent a priori choice for the development of diagnostic techniques. In the work of [95], an unsupervised sparse filtering method based on a two-layer neural network is used to directly learn features from mechanical vibration signals. In the work of [96], a Spatio-Temporal Pattern Network (STPN) based on Probabilistic Finite State Automation (PFSA) and Markov machines is proposed to represent temporal and spatial structures for fault diagnostics in complex systems。
故障诊断需要能够识别贝叶斯网络(BN)[74]、决策树(DT)[75]、线性判别分析(LDA)[75]、K最近邻算法(KNN)[75]、人工神经网络(ANNs)和支持向量机(SVMs)[75][76][77]的数据分析技术,在不同的工业和民用领域应用中取得了成功[78]。然而,这些方法依赖于有标签数据的监督学习,在实践中很少有这样的数据,因此其实际应用受到局限[79]:实际应用需要对无标签数据进行无监督学习。无监督学习是机器学习中的一个重要主题,用于时间序列分割[80,81]和模式识别[21,82,83]。在故障诊断应用中,它被用于提供原始测量数据的抽象表示,并获得表示正常和故障状态的各种聚类[22,[84][85][86]。
在[22]的工作中,提出了一种针对滚动轴承的无监督学习方法,即分类对抗自动编码器(CatAAE),用于故障诊断。在[84]的工作中,基于无监督谱聚类(SC)与模糊C均值(FCM)相结合的诊断方法被开发用于识别核电机组的相似关停过渡。在[85]中,自组织映射(SOM)被用于对铁路信号系统的降级状态进行聚类和识别。在[86]中,一种结合k-means和关联规则挖掘(ARM)的方法被开发用于挖掘风力涡轮机的故障数据并诊断故障发生之间的相互关系。
表示学习可以解开数据背后的不同解释因素的变化,使在构建故障诊断模型时更容易提取和组织辨别信息[87][88][89][90][91][92]。在传统的无监督故障诊断方法中,使用特定的信号处理技术对采集到的信号进行特征提取,例如傅里叶谱分析和小波变换[93]。这种处理过程严重依赖先验知识和诊断专业知识[22,94],而且可能非常耗时和劳动密集[88]。由于表示学习能够从原始数据中自适应地学习特征,因此它可以成为开发诊断技术的优秀先验选择。
在[95]的工作中,使用基于两层神经网络的无监督稀疏滤波方法直接从机械振动信号中学习特征。在[96]的工作中,提出了一种基于概率有限状态自动机(PFSA)和马尔可夫机的时空模式网络(STPN),用于表示复杂系统的时间和空间结构以进行故障诊断。
However, these conventional representation learning methods cannot capture long-term temporal dependencies in the time series and they typically require high computational complexity.
From the above, it is seen that traditional fault diagnostic approaches typically require the acquisition of signal measurements from SSCs whose true degradation state is known. However, to acquire such labelled data is a difficult, expensive and labor-intensive task. Furthermore, streaming data collected in online-monitored SSCs have long-term temporal dependencies. However, unsupervised learning methods have a hard time dealing with long-term time dependencies, because these dependencies are limited by the size of the sliding time window which can be used for the analysis. Then, there is a need for advancements in the methods to estimate the degradation level at a given time on the basis of a few run-to-failure trajectories with long-term temporal dependencies and for which the true degradation state is unknown. In the work of [97], for example, a two-stage method for unsupervised learning is proposed for fault diagnostic applications, inspired by the idea of representing temporal patterns by a mechanism of neurodynamical pattern learning, called Conceptor. Considering a reservoir, i.e. a Fig. 4. Illustration of GAN-based method in fault detection w.r.t. the challenge of missing fault data [39]. GAN-based method is a type of distribution reconstruction method which reproduces the normal-conditions data distribution by the Generator and uses an extra Encoder to form an Auto-Encoder, which can obtain anomaly scores (reconstruction errors) to distinguish whether samples are anomalous or not. randomly generated and sparsely connected RNN [98], Conceptors can be understood as filters characterizing the geometries of the temporal states of the reservoir neurons in the form of square matrices [99], achieving a direction-selective damping of high-dimensional reservoir states [100]. The proposed method develops in two stages. In the first stage, the Conceptors extracted from the training run-to-failure degradation trajectories are clustered into several non-overlapped time series segments representing different degradation levels. In the second stage, the Conceptors and corresponding labels obtained in the first-stage clustering are used to train a Convolutional Neural Network (CNN) for real-time diagnosing the SSC degradation level. The CNN receives in input the Conceptors extracted from the reservoir states at the current time, which contain information about the long-term evolution of the SSC degradation, and the difference between the Conceptors extracted at the present and previous time steps, which contains information about the short-term degradation variation. The proposed method has been applied to two literature case studies concerning bearings fault diagnostics. The results show satisfactory accuracy and efficiency of the method. The Reservoir computing-based method for addressing the challenge of missing labels of degradation state is shown in Fig. 5.
然而,这些传统的表示学习方法无法捕捉时间序列中的长期时序依赖关系,并且通常需要高计算复杂性。从上面可以看出,传统的故障诊断方法通常需要获得已知真实退化状态的SSC的信号测量。然而,获取此类标注数据是一项困难、昂贵且劳动密集的任务。此外,在线监测的SSC收集的流数据具有长期的时序依赖关系。然而,无监督学习方法难以处理长期时序依赖关系,因为这些依赖关系受到用于分析的滑动时间窗口的大小的限制。因此,需要改进方法,以估计在具有长期时序依赖关系且真实退化状态未知的几个失效轨迹上的给定时间的退化水平。例如,在[97]的工作中,提出了一种用于故障诊断应用的无监督学习的两阶段方法,灵感来自于一种称为Conceptor的神经动力学模式学习机制,用于表示时间模式。考虑到一个储备库,即一个随机生成的和稀疏连接的RNN [98],Conceptor可以被理解为以方阵形式表征储备库神经元的时间状态的几何特征的滤波器 [99],实现高维储备库状态的方向选择性阻尼 [100]。所提出的方法分为两个阶段。在第一阶段,从训练的失效退化轨迹中提取的Conceptors被聚类成几个不重叠的时间序列段,表示不同的退化水平。在第二阶段,使用第一阶段聚类获得的Conceptors和相应的标签来训练卷积神经网络(CNN),以实时诊断SSC的退化水平。CNN接收当前时间储备库状态中提取的Conceptors作为输入,这些Conceptors包含有关SSC退化的长期演变的信息,以及当前时间步与上一个时间步提取的Conceptors之间的差异,这些差异包含有关短期退化变化的信息。所提出的方法已应用于两个关于轴承故障诊断的文献案例研究。结果显示该方法具有满意的准确性和效率。解决退化状态标签缺失挑战的储备计算方法如图5所示。
Table 2 summarizes the fault diagnostics techniques, with specific regard to the challenge of missing labels for degradation states.
表2总结了故障诊断技术,特别关注缺失标签的退化状态的挑战。
Prognostics is concerned with the prediction of the future evolution to failure of the state of a SSC. It involves the processing of data to predict the future degradation of the SSC structural and functional attributes, based on which to estimate the SSC failure probability and RUL. The prognostic outcomes are used for the health management of the SSC, which seeks to use the prognosis to decide on and actuate operational actions and maintenance interventions. To the uncertainties coming from the use of the data available from the sensors, like for the detection and diagnosis tasks, prognostics adds further challenges related to the future evolution of the usage profile and operational environment, whose uncertainties affect the degradation state evolution. This makes it practically impossible to precisely predict the future evolution of the SSC state of health and it is necessary to account for the different sources of uncertainty that affect prognostics, within a systematic framework for uncertainty quantification and management [102].
Prognostics is dependent on the available knowledge, information and data on the process of degradation. There may be situations in which a sufficient quantity of run-to-failure data has been collected during the life of the SSCs, and these can be used to develop empirical (data-driven) models. In other cases, the degradation mechanism is known and a physics-based model is available. On these bases, prognostics approaches can be grouped into three categories: (i) model-based, (ii) datadriven and (iii) hybrid:
• Model-based approaches use physics-based degradation models to predict the future evolution of the SSCs degradation state and infer the time at which the degradation will reach the failure threshold. These approaches have been applied with success in various practical cases, e.g., to pneumatic valves [103], Li-Ion batteries [104], the residual heat removal subsystem of a nuclear power plant [105], and structures subject to fatigue degradation [106]. In the case of complex SSCs, subject to multiple and competing degradation mechanisms, accurate physics-based models are, however, often not available. • Data-driven approaches directly extract from the data the degradation law for SSCs RUL prediction [107]. Such approaches include conventional numerical time series techniques, as well as AI intelligence and data mining algorithms, such as similarity-based [108] and regression-based methods [107]. A variety of AI techniques, such as Convolutional Neural Network (CNN) [109,110], Denoising Auto Encoder (DAE) [111], Long Short Term Memory (LSTM) [109,[112][113][114], Gated Recurrent Unit (GRU) [115], SVM [116], Adjacency Difference Neural Network (ADNN) [117], are applied to RUL estimation of different industrial systems and components. The performances of data-driven approaches depend on the quantity and informative quality of the data available to develop the predictive models. • Hybrid approaches combine, all the available sources of knowledge, information and data. They bring the advantages of both modelbased and data-driven methods. Specifically, they can integrate the robustness and interpretability of model-based methods with the specificity and accuracy of data-driven methods. For instance, [118] combines Kalman Filtering (KF) with data-driven approaches, [119] integrates the Health Indicator (HI) and regression model, [120] combines Relevance Vector Machine (RVM) and Particle Filtering (PF), and [121] integrates a physical model and the Least Square (LS) method to estimate RUL of a variety of industrial equipment.
Traditional fault prognostic methods face the challenge of dealing with incomplete and noisy data collected at irregular time steps, e.g. in correspondence of the occurrence of triggering events in the system. For example, for monitoring the degradation and failure processes of bearings in large turbine units, signal measurements collection (e.g., vibration signals measured by eddy current displacement sensors measuring the radial vibration of the rotor at both ends, the axial vibration of the rotor, and sensors measuring the unit rotating speed) is only triggered by abnormal behaviors of the units, such as large environmental noise and anomalous vibration behavior. These “snapshot” datasets are often
预测学关注的是结构系统与构件(以下简称SSC)状态的未来失效演变。它通过处理数据来预测SSC结构和功能属性的未来退化情况,基于此来估计SSC的失效概率和剩余使用寿命(RUL)。预测结果用于SSC的健康管理,以决定并实施运行操作和维护干预。在使用传感器获取数据的不确定性基础上,预测还增加了与使用情况和操作环境的未来演变相关的挑战,这些不确定性影响了退化状态的演变。这使得准确预测SSC健康状态的未来演变几乎不可能,有必要在不确定性量化和管理的系统框架内考虑影响预测的不同不确定性来源[102]。
预测学依赖于可用的关于退化过程的知识、信息和数据。在一些情况下,SSC的历史故障数据可能足够充分,可以用来建立经验(数据驱动)模型。在其他情况下,退化机制已知且存在物理模型。基于此,预测方法可以分为三类:(i)基于模型的方法,(ii)数据驱动方法和(iii)混合方法:
传统的故障预测方法面临的挑战是处理在不规则时间步骤下收集的不完整和噪声数据,例如,在系统发生触发事件时。例如,对于监测大型涡轮机组中轴承的退化和故障过程,信号测量的收集(例如,由涡流位移传感器测量的转子径向振动、轴向振动以及测量单位旋转速度的传感器)仅在单位出现异常行为时触发,例如大环境噪声和异常振动行为。这些“快照”数据集通常是不完整的,并且在采集时存在不规则时间步骤。
Method: [26] rotor crack diagnostics RF: [30] power generators fault detection AAKR: [42] non-linear multimode processes fault detection [43] reactor coolant pump fault detetion [44] power plant fault detection EMD: [66] bearing fault detection GAN-based: [39] high-speed train automatic door Observer theory: [27] rotor fault detection
方法:[26] 转子裂纹诊断 RF:[30] 发电机故障检测 AAKR:[42] 非线性多模态过程故障检测 [43] 反应堆冷却泵故障诊断 [44] 发电厂故障检测 EMD:[66] 轴承故障检测 基于GAN的方法:[39] 高速列车自动门 观察器理论:[27] 转子故障检测
[31] early fault detection of numerical case [35] high-speed train brake fault detection
PCA: [45] air handling unit fault detection
[31] 提前检测数字案例的故障 [35] 高速列车刹车故障检测
PCA(主成分分析):[45] 空气处理装置故障检测
Distance: [67] tank heater simulation case fault detection
Kantorovich的这一章节涉及的内容是关于[67]储罐加热器模拟案例的故障检测。
[28] mechanical equipment early fault detection ANN: [32] heavy-water reactor early fault detection [34] wind turbine false alarm detection
[28] 机械设备早期故障检测的人工神经网络:[32] 重水反应堆早期故障检测 [34] 风力涡轮机误报警检测
[46] building air conditioning system fault detection
Distance: [68] bearings fault detection
距离度量:[68] 轴承故障检测
[33] software fault detection ANN: [47] wind turbine gearbox fault detection CDT: [69] numerical case E. Zio encountered in industrial applications, dominated by the necessity of cost saving in storing and managing the databases, and of reducing energy consumption and bandwidth resources. Since failure events are rare, event-based datasets are dominated by missing measurements, where the values of all signals are missing at the same time. With these characteristics, traditional methods for missing data management, e.g. case deletion, imputation [122][123][124][125] and maximum likelihood estimation [126], are difficult to apply. For instance, since case deletion methods discard patterns whose information is incomplete, they are not useful in case of event-based datasets where a pattern is either present or absent for all signals [126]. Imputation techniques, which are based on the idea that a missing value of a signal can be replaced by a statistical indicator of the probability distribution generating the data, such as the signal mean value [127] or a value predicted by a multivariable regression model, have been shown inaccurate in case of large fractions of missing values in the dataset [128]. Maximum Likelihood methods use the available data to identify the values of the probability distribution parameters with the largest probability of producing the sample data. They typically require the Missing At Random (MAR) assumption, i.e. the probability of having a missing value is not dependent on the missing values [127,129], which is not met in event-based datasets. Few research works have considered fault prognostics in presence of missing data. A model based on Auto-Regressive Moving Average (ARMA) and Auto-Associative Neural Networks (AANN), has been developed for fault diagnostics and prognostics of water process systems with incomplete data [130]. An integrated Extreme Learning Machine (ELM)-based imputation-prediction scheme for prognostics of battery data with missing data [125] and an hybrid architecture of physics-based and data-driven approaches have been proposed to deal with missing data in a rotating machinery prognostic application [131].
In the medical field, a Bayesian simulator has been used to generate missing data for developing prognostic models [132] and a Multiple Imputation approach has been embedded within a prognostic model for assessing overall survival of ovarian cancer in presence of missing covariate data [133]. Notice that all these methods are based on the two successive steps of missing data reconstruction and prediction.
Then, advancements and new methods are still needed to enable predicting the RUL of a SSC on the basis of measurements collected only when triggering events occur, such as SSC faults or extreme operational conditions, and providing an estimate of the uncertainty affecting the RUL prediction. As an example, [134] has developed a method based on Echo-State Networks (ESNs) to directly predict the RUL of a SSC without requiring to reconstruct the missing data. ESNs are considered because of their ability of maintaining information about the input history inside the reservoir states. The main difficulty is that, contrarily to the typical applications of ESNs, the time intervals at which the data become available are irregular. Two different strategies have been considered to cope with the event-based data collection. In one strategy, the ESN receives an input pattern only when an event occurs. The pattern is formed by the measured signals and the time at which the event has occurred. In a second strategy, the reservoir states are excited at each time step. If an event has occurred, the reservoir states are excited both by the previous reservoir states and the measured signals, whereas, if an event has not occurred, they are excited only by the previous reservoir states. By so doing, the connection loops in the reservoir allow reconstructing the SSC dynamic degradation behavior at those time steps in which events do not occur. Multi-Objective Differential Evolution (MODE) algorithm based on a Self-adaptive Differential Evolution with Neighborhood Search (SaNSDE) [135] is used to optimize the ESN hyper-parameters. The Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) [136] is, then, used to select the optimal solution from the obtained Pareto solutions. Furthermore, a bootstrap aggregating (Bagging) ensemble method is applied to improve the RUL prediction accuracy and estimate the RUL prediction uncertainty. Given that ESNs cannot be fed by random sequences of patterns, the traditional Bagging sampling mechanism used to create the bootstrap training sets has been modified. In the proposed solution, the bootstrap training sets are obtained by concatenating entire run-to-failure trajectories, randomly sampled with replacement. The benefits of the proposed methods are shown by application to the prediction of the RUL of a sliding bearing of a turbine unit. The ESN-based one-step RUL prediction method for the challenge of missing data, i.e., event-based measurements, is shown in Fig. 6.
Table 3 summarizes fault prognostics techniques, with specific regard to the challenge of missing data, i.e. event-based measurements.
[33] 软件故障检测用人工神经网络 (ANN):[47] 风力涡轮机齿轮箱故障检测协调发展 (CDT):[69] E. Zio 遇到的在工业应用中主导的一类问题是在存储和管理数据库时需要节省成本,减少能源消耗和带宽资源。由于故障事件很少发生,基于事件的数据集通常存在缺失测量值问题,即所有信号值同时缺失。传统的缺失数据管理方法,如案例删除、插补 [122][123][124][125] 和最大似然估计 [126],很难应用。例如,案例删除方法会放弃信息不完整的模式,因此在基于事件的数据集中并不适用,因为一个模式要么对所有信号存在,要么对所有信号缺失[126]。插补技术是基于一个信号的缺失值可以由生成数据的概率分布的统计指标(比如信号均值[127]或由多变量回归模型预测的值)来替换的思想。然而,当数据集中存在大量缺失值时,插补方法的准确性会受到影响[128]。最大似然方法使用现有数据来确定具有最大概率产生样本数据的概率分布参数的值。这些方法通常需要缺失随机 (MAR) 假设,即缺失值的概率不依赖于缺失值本身[127,129],而这在基于事件的数据集中并不成立。在存在缺失数据的情况下,很少有研究考虑了故障预测。一个基于自回归移动平均 (ARMA) 和自联想神经网络 (AANN) 的模型已经被开发用于处理存在不完整数据的水处理系统的故障诊断和预测[130]。一种基于极限学习机 (ELM) 的插补-预测方案已被提出,用于处理存在缺失数据的电池数据的预测[125],以及一种基于物理模型和数据驱动方法的混合体系结构已被提出,用于处理旋转机械预测应用中的缺失数据[131]。
在医学领域,已经使用贝叶斯模拟器生成缺失数据以开发预测模型[132],并且已将多重插补方法嵌入到缺失协变量数据存在的卵巢癌总体生存预测模型中[133]。注意,所有这些方法都基于缺失数据重建和预测这两个连续步骤。
因此,我们仍需进一步发展和研究新的方法,以便仅基于在触发事件(如部件故障或极端操作条件)发生时收集的测量值来预测结构、系统和组件 (SSC) 的剩余使用寿命 (RUL),并提供对RUL预测的不确定性估计。例如,[134] 已开发了一种基于回声状态网络 (ESN) 的方法,可直接预测 SSC 的 RUL,并无需重建缺失数据。ESN之所以被使用,是因为其能够在储层状态中保留输入历史信息的能力。主要的困难在于,与ESN的典型应用不同,数据可用性的时间间隔是不规则的。为了应对基于事件的数据采集问题,已经考虑了两种不同的策略。在一种策略中,ESN仅会在事件发生时接收输入模式。模式由测量信号和事件发生时间组成。在第二种策略中,储层状态在每个时间步都被激活。如果事件发生,储层状态将由先前的储层状态和测量信号共同激活;如果事件未发生,储层状态只由先前的储层状态激活。通过这样做,储层中的连接环路可以在事件不发生的时间步上重建 SSC 的动态退化行为。使用基于多目标差分进化 (MODE) 算法和基于自适应差分进化与邻域搜索 (SaNSDE) [135] 的方法来优化 ESN 的超参数。然后,使用理想解排序顺序偏好技术 (TOPSIS) [136] 从得到的 Pareto 解中选择最优解。此外,还应用了自举聚合 (Bagging) 集成方法,以提高 RUL 预测的准确性并估计 RUL 预测的不确定性。鉴于 ESN 不能接受随机序列的模式,因此创建自举训练集的传统 Bagging 抽样机制已被修改。在提出的解决方案中,自举训练集通过随机抽样(有放回)完整的故障运行路径进行拼接获得。应用上述方法来预测涡轮机组滑动轴承的 RUL,并展示了基于事件测量的缺失数据的ESN一步RUL预测方法,如图6所示。
表3总结了故障预测技术,特别关注基于事件测量的缺失数据问题。
The ability to correctly interpret a PHM model’s output, be it the detection of a fault, its diagnosis or prognosis, is extremely important, and particularly so in safety-critical applications like those concerning the high-risk systems and processes of the chemical, nuclear, aerospace industries, to name a few. It allows understanding of the state of the system or process being modeled and supports analytic reasoning and prescriptive decision making to intervene (or not) and how. It also engenders appropriate trust by the analyst, providing insights on how the model works. The importance of this is such that in some applications, simple models (e.g., even linear models) are preferred for their ease of interpretation, even if they may be less accurate than complex ones. Yet, currently the growing availability of big data for PHM has increased the benefits of using complex models for achieving accuracy, at the expenses of model intelligibility. This brings to the forefront the need of a tradeoff between accuracy of the model and interpretability of its output. A wide variety of different methods have been recently proposed to address this issue, but an understanding of how these methods relate and when one method is preferable to another is still lacking.
Most models and algorithms for PHM are developed and trained to maximize accuracy, neglecting interpretability and causality. Accounting for these aspects may, indeed, lead to a loss in performance but would enhance their safe, reliable and robust use both in terms of undesired biases and uncertainty reduction. Understanding why a PHM model makes a certain prediction can be as crucial as the prediction’s accuracy in many applications. However, the highest accuracy for large modern datasets is often achieved by complex models that even experts struggle to interpret, such as ensemble or deep learning models, creating a tension between accuracy and interpretability [137]. Some general attributes sought for in the interpretability of PHM models are:
• fairness: no discrimination in algorithm decisions, which could come from bias in the collected data • robustness: small changes in input should not cause big changes in output • causality: causal relations are picked up from the model and rendered explicit • quantifiable reliability of outcomes and predictions.
The awareness of the relevance of transparency, explainability and interpretability of PHM models is growing as a need and a requirement, particularly for supporting decision making in safety critical systems, for which it may also be a regulatory prerequisite. For example, in Nuclear Power Plants (NPPs), there is still resistance to the deep penetration of digital I&C systems and PHM, because of the difficulty of testing performance under all postulated conditions, on one side, and guaranteeing reliability based on transparent understanding and interpretation, on the other side. The decision making related to tasks of control, operation, maintenance and safety of NPPs, which have traditionally relied on procedures and expert evaluation and judgment, are gradually being assisted by intelligent machines (i.e. software algorithms) for PHM, developed and trained on the basis of big and customized data: how far and how it can be permitted in safety-critical systems that require licensing depends also on the possibility of interpreting the causality of their output.
For the modeling approaches to PHM based on learning from data, one issue lies in possible biases in the training set that are, then, not In this sense, achieving robustness in PHM models is fundamental and one way to proceed is to try to design inherently interpretable models, i. e. so as to exclude all undesired features that are not causally related to the outcome. By examining interpretable models:
正确解读PHM模型的输出能力,无论是故障检测、诊断还是预测,都是极其重要的,尤其是在与化学、核能、航空航天等高风险系统和过程相关的安全关键应用中。它能够理解被建模的系统或过程的状态,并支持分析推理和决策干预(或不干预)的决策制定。它还能够为分析人员提供对模型工作原理的深入了解,从而产生适当的信任。这一点的重要性在某些应用中尤为突出,例如,在一些应用中,甚至倾向于使用简单模型(例如,线性模型),因为它们易于解释,即使它们可能比复杂模型的准确性要低。然而,当前可用于PHM的大数据的不断增长增加了使用复杂模型提高准确性的好处,但以牺牲模型的可解释性为代价。这使得在模型的准确性和可解释性之间需要权衡。最近提出了各种不同的方法来解决这个问题,但我们仍然缺乏对这些方法之间关系及何时使用一种方法优于另一种方法的理解。
大多数PHM模型和算法的开发和训练都是为了最大限度地提高准确性,而忽视了可解释性和因果关系。考虑这些方面可能会导致性能下降,但会增强模型在消除不希望的偏差和减少不确定性方面的安全性、可靠性和鲁棒性。在很多应用中,了解为什么一个PHM模型做出某个预测与预测的准确性同样重要。然而,对于大型现代数据集,最高的准确性通常是由复杂模型实现的,即使对于专家来说,也很难解释,比如集成或深度学习模型,这就在准确性和可解释性之间产生了张力。对于PHM模型的可解释性,人们通常希望具备以下一些通用特性:
• features or functions capturing quirks in the data can be noted and excluded, thereby avoiding related harm in the successive use of the model output, and the understanding of the phenomena analyzed • knowledge can be extracted, in terms of the interactions among the inputs and how they determine the output • an evaluation of the reliability of the PHM outcomes can be performed • some limited extrapolation can be possible, with the aim of gaining knowledge on unexplored scenarios.
Methodologies are used to gain interpretability in a model by looking at the importance of the different input features in determining the model outputs. A distinction is made between model-specific and modelagnostic methodologies for evaluating feature importance. An interesting example of the former is the “attention mechanism” for Neural Networks applied in Prognostics, where importance values are assigned to specific input subsets [138,139].
As the name implies, model-agnostic feature importance evaluation methodologies can in principle be used for any model. Local approaches are used for online applications and global approaches for offline applications. Local measures focus on the contribution of features to a specific outcome instance, whereas global measures take all outcomes into account.
The Local Interpretable Model Explanation (LIME) method aims at explaining individual outputs and can be applied to any learning model [140]. Instead of training a global surrogate model, LIME focuses on training local surrogate models to explain individual model outputs. The method works by building for each output instance of interest a local-interpretable model that approximates the original, complex model. Each model output instance is, then, explained by an “explainer-model” that highlights the symptoms that are most relevant to it. With this information about the rationale behind the model, the analyst is now empowered to trust the model output-or notfor her/his decisions and consequent actions.
The idea behind LIME is quite intuitive and it is based on the fact that one can probe the model as many times as desired, by feeding the input data points and retrieving the corresponding outputs of the model. The goal of this is to understand why the learning model gave a certain output. The LIME tests are local sensitivity tests performed in a way to explore what happens to the output when the inputs are locally varied by small perturbations. By so doing, a new dataset is generated, consisting of permuted input samples and corresponding model outputs. For example, the new samples can be created by perturbing each feature individually, drawing from a normal distribution with mean and standard deviation taken from the feature values. On this new dataset, LIME builds and trains the interpretable explainer-model, which is weighed by the proximity of the sampled instances to the instance of interest. The interpretable model should give a good approximation of the original model outputs locally, but it does not have to be a good global approximation of the original model itself. Mathematically, the interpretable explainer model for instance x is the (simple) model g (e.g. a linear regression model) that results as solution of the optimization problem that minimizes the loss function L (e.g. the mean squared error) measuring how close the explanation output of g is to the output of the original model f (e.g. a neural network), while the model complexity Ω(g) is kept low (e.g. as few features as possible):
• 可以注意并排除捕捉数据中怪异特征或函数,从而避免在后续使用模型输出时产生相关危害,并理解所分析现象的方式。
• 可以提取知识,包括输入之间的相互作用以及它们如何决定输出。
• 可以对PHM结果的可靠性进行评估。
• 可以进行有限的推导,以获得对未探索场景的了解。
方法用于通过查看不同输入特征在确定模型输出方面的重要性来获得模型的可解释性。在评估特征重要性方面,可以区分模型特定和模型无关的方法。前者的一个有趣例子是在预测中应用于神经网络的“注意力机制”,其中将重要性值分配给特定的输入子集[138,139]。
正如名称所示,原则上可以将模型无关的特征重要性评估方法应用于任何模型。局部方法用于在线应用程序,全局方法用于离线应用程序。局部度量关注特征对特定输出实例的贡献,而全局度量则考虑所有输出。
本地可解释模型解释(LIME)方法旨在解释个别输出,并可应用于任何学习模型[140]。 LIME不会训练全局代理模型,而是专注于训练局部代理模型来解释个别模型输出。该方法通过为每个感兴趣的输出实例构建一个局部可解释模型来近似原始的复杂模型。然后,每个模型输出实例由一个“解释器模型”来解释,该模型突出显示与其最相关的症状。有了关于模型背后原理的这些信息,分析师现在可以决定是否相信模型输出并根据此作出决策和相应的行动。
LIME背后的想法非常直观,它基于以下事实:可以根据需要多次测试模型,通过输入数据点并检索模型的相应输出,目标是理解为什么学习模型会给出某个特定输出。 LIME测试是局部敏感性测试,以探索在输入局部变化的情况下输出会发生什么变化。通过这样做,会生成一个新的数据集,由置换的输入样本和相应的模型输出组成。例如,可以通过单独扰动每个特征,从具有特征值的正态分布中进行绘制,来创建新样本。在这个新的数据集上,LIME构建和训练可解释的解释模型,该模型由与感兴趣实例的接近程度加权的样本实例组成。可解释模型在局部上应该很好地近似原始模型的输出,但不必是原始模型本身的良好全局近似。从数学上讲,实例x的可解释解释模型是问题优化的解(例如线性回归模型g),该问题最小化损失函数L(例如均方误差),该损失函数测量可解释模型g的解释输出与原始模型f(例如神经网络)的输出之间的接近程度,同时使模型复杂度Ω(g)保持低(例如尽可能少的特征)。
explanation(x) = argmin g∈G L(f , g, π x ) + Ω(g)(1)
where G is the family of possible explainer models, for example all possible linear regression models, and the proximity measure π x defines how large is the neighborhood around instance x that is considered for the explanation. In practice, LIME only optimizes the loss part and the user controls the model complexity by Ω(g), e.g. by selecting by forward and backward feature selection methods the maximum number of features that the linear regression model may use. The procedure for interpreting locally the complex original model is, then:
(i) select the instance of interest x for which an explanation of the original complex model outcome f(x) is needed (ii) perturb the input data and get the original model output values for these new data samples (iii) weigh the new samples according to their proximity to the instance of interest (iv) train a weighed, interpretable model on the new dataset generated in ii) Fig. 6. Illustration of ESN-based method in fault prognostics w.r.t. the challenge of missing data, i.e. event-based measurements [134]. The input neurons of ESN are excited to update the reservoir state when measurements are available (events are triggered), whereas the input neurons are canceled if data are missing (no events occur) and the reservoir is only updated by the reservoir state at the previous time step and the target signal, which force the reservoir to learn from the historical degradation pattern and the target signal evolution pattern.
解释性模型中,解释函数可以表达为 argmin g∈G L(f , g, π x ) + Ω(g)(1),其中G是可能的解释模型的集合,例如所有可能的线性回归模型,接近度测量π x 定义了被考虑用于解释的实例x的领域大小。在实践中,LIME仅优化损失部分,用户通过Ω(g)控制模型的复杂度,例如通过正向和反向特征选择方法选择线性回归模型可以使用的最大特征数。然后,解释复杂原始模型的过程如下:
(i) 选择需要对原始复杂模型结果 f(x) 进行解释的实例 x;
(ii) 扰动输入数据并获取这些新数据样本的原始模型输出值;
(iii) 根据它们与感兴趣实例的接近程度,对新样本进行加权;
(iv) 在ii)中生成的新数据集上训练一个加权可解释模型。图6. 是基于ESN的故障预测方法在缺失数据的挑战方面的示例,即基于事件的测量[134]。ESN的输入神经元在得到测量时(事件被触发)被激活以更新储层状态,而如果数据缺失(没有事件发生),输入神经元则被取消,储层只能通过上一个时间步的储层状态和目标信号进行更新,从而将储层从历史退化模式和目标信号演变模式中进行学习。
(v) explain the local output of the interpretable model g.
LIME has been applied for the interpretation of machine learning models in applications of medical diagnostics [141]. In a recent study about early Parkinson detection, LIME has been used to highlight the features determining the healthy/disease decision of a ML classifier of images of the brain: LIME allows highlighting the super-pixels mostly determining the classification in healthy or disease states; experts can, then, focus on the super-pixels selected with LIME to interpret and explain the basis for the decision by the ML algorithm, and choose to accept or refuse it.
Shapley values also can be used to assess local features importance [142]. Although they can be used to explain which feature(s) contribute most to a specific model output, Shapley values are not designed to answer the “what would happen if” questions that LIME’s local explainer models are designed for. They come from game theory and are designed to construct a fair payout scheme for the players in a game. Suppose one could look at all possible combinations of (a subset of) players in a team replaying a game and observe the resulting team score. One could, then, assign each player of the team a portion of the total payout based on its average added value across all possible subteams to which it was added to play the game repeatedly. Such individual payout is the player’s Shapley value and gives the only payout scheme that is proven to be:
(v)解释可解释模型g的局部输出。
LIME已经被应用于解释医学诊断中的机器学习模型[141]。在最近一项早期帕金森病检测的研究中,LIME被用来强调决定脑部图像的健康/疾病分类的特征:LIME可以突出显示在健康或疾病状态下最具决定性的超像素;专家可以专注于使用LIME选择的超像素,以解释和说明ML算法决策的基础,并选择接受或拒绝该决策。
Shapley值也可以用于评估局部特征的重要性[142]。虽然它们可以用来解释哪些特征对于特定模型输出的贡献最大,但Shapley值并不能回答LIME的局部解释模型所设计用来回答“如果发生了什么”问题。Shapley值源自博弈论,旨在为参与博弈的玩家构建公平的支付方案。假设我们可以观察到所有可能的团队组合(子集)再玩游戏,并观察到结果的团队得分。然后,可以根据每个玩家在所有可能的子团队中的平均附加值分配给该玩家总支付的一部分。这个个体支付就是玩家的Shapley值,并且给出了唯一被证明是公平的支付方案:
• efficient: the sum of the Shapley values of all players should sum up to the total payout • symmetric: two players should get the same payout if they add the same value in all team combinations • dummy-sensitive: a player should get a Shapley value of zero if it never improves a subteam’s performance when it is added • additive: in case of a combined payout (say we add two game bonuses), the combined Shapley value of a player across the games is the sum of the individual game’s Shapley values; this criterion has no relevant analogy in the context of model interpretability.
In the “game” of our interest for PHM model interpretability, the players are models with different features subsets and they get the same payout mechanism introduced above. The team score in this context is the performance measure of a (sub)model built on a given feature subset. The total payout is the difference between a base valueoutput of the null model -and the actual output. This difference is, then, divided over all features in accordance to their relative contribution.
• 高效性:所有参与者的Shapley值之和应等于总赔付金额
• 对称性:如果两个参与者在所有团队组合中添加相同的价值,则应获得相同的赔付金额
• 对虚拟参与者敏感:如果一个参与者在添加时从未改善子团队的表现,则应获得Shapley值为零
• 可加性:在组合赔付的情况下(比如说添加了两个游戏奖金),一个参与者在多个游戏中的组合Shapley值等于各个游戏的Shapley值之和;这个标准在模型可解释性的背景下没有相关的类比。
在我们对PHM模型可解释性感兴趣的“游戏”中,参与者是具有不同特征子集的模型,并且它们采用了上述相同的赔付机制。在这种情况下,团队评分是基于给定特征子集构建的(子)模型的性能度量。总赔付金额是空模型的基础值产出与实际产出之间的差值。然后,将这个差值按照其相对贡献,分配给所有特征。
Obviously looking at all possible subsets of features is computationally prohibitive in most realistic models with many features. Instead, Shapley value approximations can be computed based on sampling of features.
Other model-agnostic methodologies are based on Sensitivity Analysis (SA), which has been widely applied to models used in various areas, such as nuclear risk assessment [143], industrial bioprocessing [144] and climate change [145]. Indeed, a main application of SA is for identifying the input quantities most responsible of a given output variation [146]. Both local and global approaches to SA have been developed. Local approaches identify the critical input features as those whose variation leads to the most variation in the output. One practical approach for such identification consists in perturbing one single input at a time with small variations around its nominal value, while maintaining the others set at their respective nominal values. The analysis is intrinsically local and the resulting indication can be considered valid for the characterization of the model response around the nominal values. The possibility of extending the results of the analysis to draw global considerations on the model response over the whole input variability space depends on the model itself: if the model is linear or mildly non-linear, then the extension may be possible; if the model is strongly non-linear and characterized by sharp variations, the analysis is valid only locally. Typical local approach techniques are those based on Taylor’s differential analysis and on the one-at-a-time simulation, in which the input features are varied one at a time while the others remain set at their nominal values [146].
显然,在具有许多特征的实际模型中,考虑所有可能的特征子集是计算上不可行的。相反,可以基于特征的采样来计算Shapley值的近似值。
其他独立于模型的方法基于敏感性分析(SA),已广泛应用于各个领域的模型,如核风险评估[143]、工业生物加工[144]和气候变化[145]。事实上,SA的主要应用是确定对于给定输出变化负责的输入量[146]。已经开发了局部和全局的SA方法。局部方法将关键的输入特征定义为那些导致输出变化最大的特征。对于这种确定,一种实际的方法是围绕其名义值以小幅变化来扰动单个输入,同时保持其他输入设置为各自的名义值。分析本质上是局部的,由此产生的指示可以被认为适用于模型在名义值周围的响应特性刻画。将分析结果扩展到整个输入变化空间以获得模型响应的全局观点的可能性取决于模型本身:如果模型是线性或轻度非线性的,则可能进行扩展;如果模型是强非线性的并且具有剧烈变化,该分析仅在局部有效。典型的局部方法技术包括基于Taylor微分分析和逐一模拟的技术,其中输入特征逐一变化,而其他特征保持为其名义值[146]。
In those situations (often encountered in practice) in which models are non-linear and non-monotone, the results provided by a local analysis may have limited significance. For this reason, global approaches to SA have been developed. In these approaches, the focus is directly on the uncertainty distribution of the output, which contains all the information about the variability of the model response, with no reference to any particular value of the input (like in the local approaches, where reference is made to the nominal values). The two principal characteristics of Table 3 Fault prognostics techniques with regard to the challenge of missing data, i.e. event-based measurements.
the global approaches are somewhat opposite to those of the local ones: 1) the account given to the whole variability range of the input features (and not only to small perturbations around the nominal values); 2) the focus on the effects resulting from considering also the variation of the other uncertain features (instead of keeping them fixed to their nominal values). Many global analysis methods have been developed [146]. The high capabilities of these methods are paid by a very high computational cost.
在实践中经常遇到的情况是,当模型是非线性和非单调的时候,局部分析给出的结果可能具有有限的意义。因此,发展了全局敏感性分析方法。在这些方法中,重点直接放在输出的不确定性分布上,该分布包含了关于模型响应变异性的所有信息,而不参考任何输入的特定值(与局部方法不同,局部方法参考的是名义值)。表3所示,全局方法与局部方法在处理缺失数据挑战方面有着相反的特点:1)考虑到所有输入特征的整个变异范围(而不仅仅是在名义值附近进行小的扰动);2)着重考虑由于考虑到其他不确定特征的变动而产生的影响(而不是将它们固定为其名义值)。已经开发了许多全局分析方法[146]。这些方法的高效能性是以非常高的计算成本为代价的。
Another direction to build interpretability into PHM models and algorithms is by integrating prior physical knowledge in the learning models, for providing improved performance and achieving interpretability. This is a promising approach for inducing interpretability into the learning models and different approaches have been proposed where the physical knowledge can be introduced at different levels of the learning process, including in the training data and in the training algorithm [147][148][149][150].
在将可解释性引入PHM模型和算法的另一个方向是通过将先前的物理知识整合到学习模型中,以提供更好的性能和实现可解释性。这是一种有前景的方法,可以将可解释性导入学习模型,并且已经提出了不同的方法,在学习过程的不同层面引入物理知识,包括在训练数据和训练算法中[147][148][149][150]。
To aid the interpretation of the model, there exists also a suite of methods for the visualization of the relations between input and output. The Partial Dependence Plot (PDP) shows the marginal effect that features have on the output provided by the model [151]. Intuitively, we can interpret the partial dependence as the expected target response as a function of the input features of interest. A partial dependence plot can show whether the relationship between the output and a feature is linear, monotonic or more complex. For example, when applied to a linear regression model, partial dependence plots always show a linear relationship. The computation of partial dependence plots is intuitive: the partial dependence function at a particular feature value represents the average output if we force all data to assume that value for the feature. If the feature for which the PDP is computed is not correlated with the other features, then the PDP perfectly represents how the feature influences the output on average. In the uncorrelated case, the interpretation is clear: the PDP shows how the average output changes when a given feature is changed. The interpretation is more complicated when features are correlated. Also, PDPs are easy to implement and the calculations to obtain them have a causal interpretation which aids model understanding: one intervenes on a feature and measures the corresponding change in the output. By doing so, one analyzes the causal relationship between the feature and the output in the model, and the relationship is causal for the model whose outcome is explicited as a function of the features. However, there are several disadvantages in PDPs. Due to the limits of human perception, the number of features in a partial dependence function must be small (usually, one or two) and, thus, the features considered must be chosen among the most important ones. Some PDPs do not show the feature distribution. Omitting the distribution can be misleading, because one might overinterpret regions with almost no data. This problem is easily solved by showing a rug (indicators for data points on the x-axis) or a histogram. Also, heterogeneous effects might be hidden because PDPs only show the average marginal effects. Suppose that for a feature, half of the input data has a positive correlation with the output (the larger the feature value the larger the output value) and the other half has a negative correlation (the smaller the feature value the larger the output value): then, PDP could be a horizontal line, since the effects of both halves of the dataset could cancel each other out and one would, then, conclude that the feature has no effect on the output. In other words, whereas the PDPs are good at showing the average effect of the target features, they can obscure a heterogeneous relationship created by interactions.
为了帮助解释模型,存在一套方法用于可视化输入和输出之间的关系。部分依赖图(Partial Dependence Plot,简称PDP)显示了特征对模型输出的边际效应[151]。直观地说,我们可以将部分依赖解释为目标响应的期望,作为感兴趣的输入特征的函数。部分依赖图可以显示输出与特征之间的关系是线性的、单调递增的还是更为复杂的。例如,当应用于线性回归模型时,部分依赖图始终显示线性关系。计算部分依赖图很直观:在特定特征值处的部分依赖函数表示如果我们让所有数据都假设该特征值,输出的平均值。如果计算PDP的特征与其他特征无关,则PDP完全代表了特征对输出的平均影响。在无关的情况下,解释很清楚:PDP显示了当给定特征变化时,平均输出如何变化。当特征相关时,解释就变得更加复杂。此外,PDP易于实施,其计算具有因果解释,有助于理解模型:我们对特征进行干预,并测量输出的相应变化。通过这样做,我们分析了特征与模型输出之间的因果关系,并且该关系对于将结果作为特征的函数显示的模型是因果的。然而,PDP也存在一些缺点。由于人类感知的限制,部分依赖函数中特征的数量必须很小(通常为一或两个),因此所考虑的特征必须是最重要的特征之一。有些PDPs不显示特征分布。省略分布可能会导致误导,因为人们可能会过度解释几乎没有数据的区域。这个问题可以通过显示一个rug(x轴上数据点的指示符)或直方图来轻松解决。此外,PDPs可能会隐藏异质效应,因为它们仅显示平均边际效应。假设对于一个特征,一半的输入数据与输出呈正相关(特征值越大输出值越大),另一半与输出呈负相关(特征值越小输出值越大):那么,PDP可能会是一条水平线,因为数据集的两半影响相互抵消,人们可能会得出结论该特征对输出没有任何影响。换句话说,虽然PDPs很好地显示了目标特征的平均效应,但它们可能会隐藏由相互作用产生的异质关系。
When interactions are present, the Individual Conditional Expectation (ICE) plot can be used to extract more insights [152]. An ICE plot shows the dependence between the output and an input feature of interest. However, unlike a PDP, which shows the average effect of the input feature, an ICE plot visualizes the dependence of the output on a feature for each sample separately, with one line per sample. Again, due to the limits of human perception, only one input feature of interest is supported by ICE plots. On the other hand, in ICE plots it might not be easy to see the average effect of the input feature of interest. Hence, it is recommended to use ICE plots alongside PDPs: they can be plotted together.
Finally, the assumption of independence is the biggest issue with PDPs. It is assumed that the features for which the partial dependence is computed are not correlated with other features. One solution to this problem is Accumulated Local Effect (ALE) plots that work with the conditional instead of the marginal distribution (Apley et al., 2020). ALE plots are a faster than and unbiased alternative to PDPs. Based on the conditional distribution of the features, they calculate differences in outputs instead of averages. ALE plots are unbiased, which means they still work when features are correlated, and are faster to compute than PDPs. The interpretation of ALE plots is also clear: conditional on a given value, the relative effect on the output due to changing the feature value can be read from the ALE plot. Even though ALE plots are not biased in case of correlated features, interpretation remains difficult when features are strongly correlated. Because if they have a very strong correlation, it only makes sense to analyze the effect of changing both features together and not in isolation. This disadvantage is not specific to ALE plots, but a general problem of strongly correlated features. Table 4 summarizes the investigated approaches for interpreting the PHM models.
当存在交互作用时,可以使用个体条件期望(Individual Conditional Expectation,ICE)图来提取更多见解[152]。ICE图显示了输出与感兴趣的输入特征之间的依赖关系。然而,与显示输入特征的平均效果的部分依赖图(Partial Dependence Plot,PDP)不同,ICE图可可视化每个样本中输出对特征的依赖关系,每个样本对应一条线。由于人类感知的限制,ICE图只支持一个感兴趣的输入特征。另一方面,ICE图可能不容易看出感兴趣的输入特征的平均效果。因此,建议在使用PDP时同时使用ICE图,它们可以一起绘制。
最后,PDP的最大问题是假设独立性。假设进行部分依赖计算的特征与其他特征不相关。解决这个问题的一种方法是累积局部效应(Accumulated Local Effect,ALE)图,它们使用条件分布而不是边际分布进行计算(Apley et al., 2020)。ALE图是PDP的一种更快速且无偏见的替代方法。基于特征的条件分布,它们计算输出的差异而不是平均值。ALE图是无偏的,这意味着当特征相关时仍然有效,并且计算速度比PDP更快。ALE图的解释也很清晰:在给定值的条件下,从ALE图中可以读出由于改变特征值而对输出产生的相对影响。尽管在特征相关时ALE图不具有偏见,但在特征强相关时仍然很难解释。因为如果它们具有非常强的相关性,只有同时分析改变两个特征值的影响才有意义,而不是单独分析。这个缺点不仅针对ALE图,而是强相关特征的一个普遍问题。表4总结了解释PHM模型的研究方法。
Applications of PHM methods for condition-based and predictive maintenance rely on the exchange and elaboration of data. The models and algorithms used are technological elements of larger socio-humantechnical systems that must be engineered with safety and security in mind. They are increasingly used in support of high-value decisionmaking processes in various industries, where the wrong decision may result in serious consequences. The underlying models and algorithms are largely unable to discern between malicious input and benign anomalous data. On the contrary, they should be capable of discerning maliciously-introduced data from benign “Black Swan” events. In particular, the learning models and algorithms should reject training data with negative impacts on results. Otherwise, learning models will always be susceptible to gaming by attackers. The specific danger is that an attacker will attempt to exploit the adaptive aspect of a learning model to cause it to fail and produce errors: if the model misidentifies an hostile input as benign, the hostile input is permitted through the security barrier; if it misidentifies a benign input as hostile, the good input is rejected [153]. The adversarial opponent has a powerful weapon: the ability to design training data that cause the learning model to produce rules that misidentify inputs. To avoid this, the models and algorithms used for PHM must have built-in forensic capabilities [154]. These should enable a form of intrusion detection, allowing engineers to determine the exact point in time that an output was given by the model, what input data influenced it and whether or not that data was
模型安全
基于状态的诊断与预测维修方法的应用依赖于数据的交换和处理。所使用的模型和算法是更大的社会-人-技术系统中的技术要素,必须考虑到安全性和保密性。它们越来越多地用于支持各个行业中的高价值决策过程,而错误的决策可能导致严重后果。这些底层模型和算法在很大程度上无法区分恶意输入和良性异常数据。相反,它们应该能够从良性的“黑天鹅”事件中区分出恶意引入的数据。特别是,学习模型和算法应该能够拒绝具有负面影响的训练数据。否则,学习模型将始终容易受到攻击者的操纵。特定的危险是,攻击者将尝试利用学习模型的适应性来使其失败并产生错误:如果模型将敌对输入错误地识别为良性输入,则敌对输入将被允许通过安全屏障;如果模型将良性输入错误地识别为敌对输入,则良性输入将被拒绝[153]。对手拥有一种强大的武器:设计训练数据,使学习模型产生错误的规则。为了避免这种情况,用于PHM的模型和算法必须具备内置的取证能力[154]。这些能力应该能够进行入侵检测,允许工程师确定模型给出输出的确切时间,影响输出的输入数据以及该数据是否被篡改。
Uncertainty is intrinsically present in the PHM tasks of detection, diagnostics and prognostics, and may adversely affect their outcomes, so to lead to an imprecise assessment of the state and prediction of the behavior of such systems, which could lead to wrongly informed system health management decisions with possibly costly, if not catastrophic, consequences. For practical deployment, it is necessary to be able to estimate the uncertainty and confidence in the outcomes of detection, diagnostics and prognostics activities, for quantifying the risk associated to the PHM decision-making on the operation of engineering systems. Yet, in spite of the recognition of the importance of uncertainty in PHM [155], work is still needed to concretely address the impact of uncertainty on the different PHM tasks and to effectively manage it.
The challenge comes from the fact that there are different sources of uncertainty that affect PHM, whose interactions are not fully understood and, thus, it is difficult to systematically account for them in the PHM tasks. While some sources are internal to the SSC, others are external, and all must be accounted for in the different activities of PHM. There is aleatory uncertainty in the physical behavior of the SSC and epistemic uncertainty in the model of it (developed based on sensors data or physic-based or based on a hybrid combination of both data and physics) and the associated parameters. As mentioned earlier, there is uncertainty in the sensors measurements and in their processing tools. For the prognostic task of PHM, there is also uncertainty on the future SSC operation profile and state evolution.
Given the relevance of uncertainty in the PHM tasks, it becomes necessary to develop systematic frameworks for accounting for such uncertainty in practical applications, in order to enable the robust verification and validation of the solutions developed, with respect to the requirements for their use for decision-making and their contribution to the risk involved in such decisions. Such frameworks must enable the systematic identification, representation, quantification and propagation of the different sources of uncertainty, so that any PHM outcome is provided also with its uncertainty, which needs to be considered for robust decision-making [156].
Focusing specifically on data-driven methods for PHM, the challenge of quantifying the uncertainty in PHM outcomes has rarely been addressed and mostly with ensemble approaches, which can become computationally burdensome, and are highly dependent on how the individual models are developed and how their outcomes are aggregated [134,[157][158][159][160][161]. Recently, Bayesian neural networks and variational inference have been used in PHM, for accounting of uncertainty [162,163]. Also, the combination of neural networks and gaussian processes are being considered as a promising direction for providing PHM outcomes equipped with the needed estimates of the associated uncertainty [164].
不确定性在PHM的检测、诊断和预测任务中是固有的,并且可能对它们的结果产生负面影响,从而导致对系统状态的不准确评估和行为预测,可能导致错误的系统健康管理决策,可能带来昂贵甚至灾难性的后果。为了实现实际应用,有必要能够估计检测、诊断和预测活动的结果的不确定性和置信度,以量化与PHM决策有关的风险,进而对工程系统的运行进行管理。尽管人们已经认识到不确定性在PHM中的重要性,但仍需努力实际解决不同PHM任务中不确定性所带来的影响,并进行有效管理。
挑战在于不同的不确定性来源影响着PHM,它们的相互作用尚未完全理解,因此在PHM任务中系统地考虑它们是困难的。虽然一些不确定性源于SSC内部,但另一些源于外部,所有这些都必须在PHM的不同活动中加以考虑。SSC的物理行为存在随机性不确定性,而其模型(基于传感器数据、基于物理还是基于数据和物理的混合组合)及其相关参数存在认知不确定性。正如前面提到的,传感器测量和处理工具存在不确定性。对于PHM的预测任务,还存在于未来SSC的操作轮廓和状态演变方面的不确定性。
鉴于不确定性在PHM任务中的重要性,需要开发系统的框架来在实际应用中解决这种不确定性,以便对所开发的解决方案进行强大的验证和验证,以满足用于决策的要求,并为这些决策所涉及的风险做出贡献。这样的框架必须能够系统地识别、表示、量化和传播不同的不确定性来源,以便为任何PHM结果提供相应的不确定性,这需要考虑到进行强大的决策制定[156]。
针对PHM的数据驱动方法,很少有人解决如何量化PHM结果的不确定性,而且多数情况下使用集合方法,这可能会带来计算上的负担,并且高度依赖于个别模型的开发方式以及其结果的聚合方式[134,[157][158][159][160][161]。最近,在PHM中使用贝叶斯神经网络和变分推断来解决不确定性问题[162,163]。此外,神经网络和高斯过程的结合被认为是为PHM结果提供所需的不确定性估计的有前途的方向[164]。
PHM has become a fashionable area of research and development, due to its promises of enabling condition-based and predictive maintenance, which can be game-changers for the production performance, reliability and safety of industrial businesses. Then, many academic words have been and are developed, and several applications have been attempted, with a more or less significant degree of success. These have been facilitated by the availability of numerous and large data sets, of affordable computational hardware to train the models, of freely available software to implement the models in a reliable and relatively straightforward manner. Yet, quite some work still needs to be done to increase the significance of PHM impacts on industry, due to a number of theoretical and practical issues that still require an effective solution. These come from different perspectives, related to the physics of the problem itself, the nature and type of data, the requirements of the solutions. As for the physics of the problem, it is undoubtful that the SSCs degradation processes in practice are most of the times quite complex and dependent on a large number of parameters and mechanisms, which are dynamic and highly non-linear, and not completely known.
But much of the problem comes from the data and the extraction of informative content for the fault detection, diagnostics and prognostics tasks of PHM. Managing and treating the big condition-monitoring data collected by the sensors and comprised of a large variety of heterogeneous signals is not an easy task and the data are often anomalous, scarce, incomplete and unlabeled. Furthermore, they are collected under changing operational and environmental conditions during the life of the SSC.
Surely, for the effectiveness of extracting informative content from data, undoubtedly Deep Learning (DL) has contributed a great leap by incorporating feature engineering in the process of learning of the models, for automatic processing of big and heterogeneous condition monitoring data and extraction of features relevant for the application. Encouraging results have been obtained already in fault detection and diagnostics, whereas Prognostics remains still a challenge for DL.
Other of the above challenges are being addressed with sophisticated advancements which need to be, then, effectively deployed in practice. These include: Recurrent Neural Networks for PHM applications, and their transformation into images so as to exploit the powerful methods of image processing (including the novel Convolutional Neural Networks (CNNs), particularly for fault detection and diagnostics; signal reconstruction methods (including Auto-Encoders) of unsupervised and semisupervised learning for fault detection and diagnostics, and for degradation state prediction, to cope with the frequent practical cases of unlabeled data; Optimal Transport (OT) methods and unsupervised adaptation techniques to cope with the problem that the test data distribution may be a different distribution (or evolve to a different distribution) than that of the training data, with the consequence that the trained data-learned model may perform poorly on the test data.
An issue of particular relevance for the prognostic task of PHM is the proper treatment of the uncertainty in the data and, then, in the models. Several sources of uncertainty exist in practice, as the models are inevitably only representations of the real relationships between input and output, the measured data are inevitably noisy due to measurement errors, and the future operational and environmental profiles of the SSCs are not known. All these uncertainties affect the predictions of the future degradation and failure of the SSCs. With respect to the uncertainty issue in PHM, frameworks are being developed for a probabilistic treatment of the RUL of SSCs: given the potentially costly and catastrophic consequences associated with the decisions that are made based on the PHM outcomes, it is obsolutely necessary to provide also estimates of the uncertainty alongside the predictions. For example, frameworks are being developed by Bayesian neural networks and deep gaussian processes.
An issue which is arising with the data-driven models and algorithms used for PHM is that they lack interpretability, which reduces trust in their use particularly for safety-critical applications. This leads to the need to find ways for improving transparency an interpretability for a clearer understanding of what the model predicts and how, and finally for building trust on its use. Methods for injecting physical information in learning models, post-hoc sensitivity approaches and visualization techniques are being studied to provide interpretability from different perspectives, including explaining the learned input-output relation representations, explaining the individual model outputs, explaining the way the output is produced by the model.
Strong concerns are also arising with respect to the security of PHM models for real applications, in particular for safety-critical ones. PHM is increasingly used to support maintenance decision-making processes in various high-value/high-risk industries, where the wrong decision may result in serious consequences. The methods and models used perform exchange and elaboration of data, and must then be secure to reject training data with negative impacts on the results of decision-making.
PHM已成为一个流行的研究和开发领域,因为它承诺能够实现基于状态的和预测性维护,这对于工业企业的生产性能、可靠性和安全性来说可能具有决定性的影响。因此,很多学术词汇已经被开发和使用,并尝试了一些应用,其成功程度或多或少地存在差异。这是由于具备大量和广泛数据集的可用性,使用价格合理的计算硬件来训练模型,以及使用易于可靠和相对简单的软件来实现模型。然而,尽管已经有大量的方法可供选择,并且正在不断开发中,但仍然需要进行大量的工作,以增加PHM对工业的影响力,因为仍然存在一些理论和实践问题需要得到有效解决。这些问题来自不同的角度,涉及问题本身的物理性质、数据的性质和类型以及解决方案的要求。关于问题本身的物理性质,毫无疑问,实际上SSCs的退化过程大多数情况下相当复杂,取决于大量的参数和机制,这些参数和机制是动态和高度非线性的,并且并不完全已知。但是,问题的很大一部分源于数据及其对于故障检测、诊断和预测任务的信息提取。管理和处理传感器收集到的大量异质信号组成的大型条件监测数据并不容易,这些数据经常是异常的、稀缺的、不完整的和无标签的。此外,在SSC的使用寿命期间,这些数据是在不断变化的操作和环境条件下收集的。当然,为了从数据中提取有信息量的内容的有效性,深度学习(DL)无疑为自动处理大型和异质条件监测数据以及提取与应用相关的特征的模型学习过程中的特征工程做出了巨大贡献。在故障检测和诊断方面已经取得了令人鼓舞的结果,而对于预测性维护,DL仍然是一个挑战。其他上述挑战正在通过先进的方法得到解决,需要对其进行有效部署。这些方法包括用于PHM应用的循环神经网络,以及将其转换为图像,以便利用图像处理的强大方法(包括新颖的卷积神经网络(CNN)),尤其是用于故障检测和诊断;以及用于故障检测和诊断的无监督和半监督学习的信号重构方法(包括自动编码器),以及降级状态预测,以处理数据无标签的频繁实际情况;使用最优输运(OT)方法和无监督适应技术来应对测试数据分布可能不同于训练数据分布(或随时间演变为不同分布)的问题,这导致训练得到的数据模型在测试数据上的性能可能较差。对于PHM的预测任务来说,一个特别重要的问题是对数据和模型中的不确定性进行正确处理。实际上存在多种不确定性,因为模型不可避免地只是输入和输出之间真实关系的表示,由于测量误差,测得的数据不可避免地是有噪声的,并且SSCs的未来运行和环境特性是未知的。所有这些不确定性都会影响对SSCs未来退化和失效的预测。关于PHM中不确定性问题,正在开发框架来对SSCs的剩余寿命进行概率处理:鉴于基于PHM结果做出的决策可能带来昂贵和灾难性的后果,提供预测的同时,也有必要提供不确定性的估计。例如,正在通过贝叶斯神经网络和深层高斯过程来开发框架。在使用于PHM的数据驱动模型和算法方面,一个令人担忧的问题是它们缺乏可解释性,这降低了对它们在特别是安全关键应用中使用的信任。这导致需要找到方法来提高透明度和可解释性,以更清楚地理解模型的预测方式,最终建立对其使用的信任。正在研究将物理信息注入学习模型、事后敏感性方法和可视化技术等方法,以提供不同角度的可解释性,包括解释学习到的输入-输出关系表示、解释个别模型输出以及解释模型产生输出的方式。同时,对于实际应用中的PHM模型的安全性,特别是对于安全关键型应用而言,人们也越来越关注。PHM越来越多地用于支持各种高价值/高风险行业的维护决策过程,在这些行业中,错误的决策可能导致严重后果。所使用的方法和模型执行数据的交换和处理,并且必须是安全的,以拒绝对决策结果产生负面影响的训练数据。
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.
我们确认在本出版物中没有已知的利益冲突,并且在此工作中没有任何可能影响其结果的重大财务支持。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。