当前位置:   article > 正文

MK趋势检验+Kendalls taub等级相关+稳健回归(Sens slope estimator等)_mk检验

mk检验

python中的Mann-Kendall单调趋势检验--及原理说明_liucheng_zimozigreat的博客-CSDN博客_mann-kendall python

前提假设:

  • 当没有趋势时,随时间获得的数据是独立同分布的。独立的假设是说数据随着时间不是连续相关的。
  • 所获得的时间序列上的数据代表了采样时的真实条件。(样本具有代表性)
  • 样本的采集、处理和测量方法提供了总体样本中的无偏且具有代表性的观测值。

pymannkendall的Python项目

什么是mann-kendall检验?

mann-kendall趋势检验(有时称为mk检验)用于分析时间序列数据的一致性增加或减少趋势(单调趋势)。这是一个非参数检验,这意味着它适用于所有分布(即数据不必满足正态性假设),但数据应该没有序列相关性。如果数据具有序列相关性,则可能在显著水平上影响(p值)。这可能会导致误解。为了克服这一问题,研究者提出了几种修正的mann-kendall检验(hamed和rao修正的mk检验、yue和wang修正的mk检验、预白化法修正的mk检验等)。季节性mann-kendall检验也被用来消除季节性的影响。

mann-kendall检验是一种强大的趋势检验,因此针对空间条件,发展了多元mk检验、区域mk检验、相关mk检验、部分mk检验等修正的mann-kendall检验。pymannkendal是非参数mann-kendall趋势分析的纯python实现,它集合了几乎所有类型的mann-kendall测试。目前,该软件包有11个mann-kendall检验和2个sen斜率估计函数。功能简介如下:

  1. 原始mann-kendall检验(原始_检验):原始mann-kendall检验是非参数检验,不考虑序列相关性或季节性影响。

  2. hamed和rao修正的mk检验(hamed和rao修正的mk检验):这个修正的mk检验由hamed和rao(1998)提出的解决序列自相关问题的方法。他们建议采用方差校正方法来改进趋势分析。用户可以通过在该函数中插入滞后数来考虑前n个显著滞后。默认情况下,它会考虑所有重要的延迟。

  3. Yue和Wang修正的MK检验(Yue-Wang_修正的检验):这也是Yue,S.,&Wang,C.Y.(2004)提出的考虑序列自相关的方差修正方法。用户还可以为计算设置所需的有效n滞后。

  4. 使用预白化方法的修正mk检验(预白化方法的修正):Yue和Wang(2002)建议在应用趋势检验之前使用预白化时间序列的检验。

  5. 使用无趋势预白化方法的修正mk试验(无趋势预白化试验):Yue和Wang(2002)也提出了在应用趋势试验之前去除趋势成分,然后对时间序列进行预白化的试验。

  6. 多变量mk检验(多变量检验):这是hirsch(1982)提出的多参数mk检验。他用这种方法进行季节性mk检验,把每个月作为一个参数。

  7. 季节性MK检验(季节性检验):对于季节性时间序列数据,Hirsch,R.M.,Slack,J.R.和Smith,R.A.(1982)提出了这个检验来计算季节性趋势。

  8. 区域mk检验(regional mk test):基于Hirsch(1982)提出的季节性mk检验,Helsel,D.R.和Frans,L.M.,(2006)建议采用区域mk检验来计算区域尺度的总体趋势。

  9. 相关多变量mk检验(相关多变量检验):hipel(1994)提出的参数相关的多变量mk检验。

  10. 相关季节性MK检验(相关季节性检验):当时间序列与前一个或多个月/季节显著相关时,使用Hipel(1994)提出的方法。

  11. 部分mk检验(部分_检验):在实际事件中,许多因素都会影响研究的主要响应参数,从而使趋势结果产生偏差。为了克服这个问题,libiseller(2002)提出了部分mk检验。它需要两个参数作为输入,一个是响应参数,另一个是独立参数。

  12. 泰尔-森斜率估计器(sen s-slope):泰尔(1950)和森(1968)提出的估计单调趋势幅度的方法。

  13. 季节sen斜率估计量(季节sen斜率):hipel(1994)提出的当数据具有季节性影响时估计单调趋势大小的方法。

功能详细信息:

所有mann-kendall检验函数的输入参数几乎相同。这些是:

  • x:向量(列表、numpy数组或pandas系列)数据
  • α:显著性水平(默认为0.05)
  • 滞后:第一个有效滞后数(仅在hamed_rao_modification_test和yue_wang_modification_test中可用)
  • 周期:季节性周期。月数据为12,周数据为52(仅在季节性测试中可用)

所有mann-kendall测试都返回一个命名元组,其中包含:

  • 趋势:显示趋势(增加、减少或无趋势)
  • h:真(如果趋势存在)或假(如果趋势不存在)
  • p:显著性检验的p值
  • z:标准化测试统计
  • :肯德尔陶
  • s:Mann Kendal的分数
  • 方差s:方差s
  • 斜率:sen的斜率

sen的斜率函数需要数据向量。季节性sen的斜率也有可选的输入周期,默认值为12。两个sen的slope函数都只返回slope值。

 Python pymannkendall包_程序模块 - PyPI - Python中文网

  1. """
  2. Created on 05 March 2018
  3. Update on 26 July 2019
  4. @author: Md. Manjurul Hussain Shourov
  5. version: 1.1
  6. Approach: Vectorisation
  7. Citation: Hussain et al., (2019). pyMannKendall: a python package for non parametric Mann Kendall family of trend tests.. Journal of Open Source Software, 4(39), 1556, https://doi.org/10.21105/joss.01556
  8. """
  9. from __future__ import division
  10. import numpy as np
  11. from scipy.stats import norm, rankdata
  12. from collections import namedtuple
  13. # Supporting Functions
  14. # Data Preprocessing
  15. def __preprocessing(x):
  16. x = np.asarray(x)
  17. dim = x.ndim
  18. if dim == 1:
  19. c = 1
  20. elif dim == 2:
  21. (n, c) = x.shape
  22. if c == 1:
  23. dim = 1
  24. x = x.flatten()
  25. else:
  26. print('Please check your dataset.')
  27. return x, c
  28. # Missing Values Analysis
  29. def __missing_values_analysis(x, method = 'skip'):
  30. if method.lower() == 'skip':
  31. if x.ndim == 1:
  32. x = x[~np.isnan(x)]
  33. else:
  34. x = x[~np.isnan(x).any(axis=1)]
  35. n = len(x)
  36. return x, n
  37. # ACF Calculation
  38. def __acf(x, nlags):
  39. y = x - x.mean()
  40. n = len(x)
  41. d = n * np.ones(2 * n - 1)
  42. acov = (np.correlate(y, y, 'full') / d)[n - 1:]
  43. return acov[:nlags+1]/acov[0]
  44. # vectorization approach to calculate mk score, S
  45. def __mk_score(x, n):
  46. s = 0
  47. demo = np.ones(n)
  48. for k in range(n-1):
  49. s = s + np.sum(demo[k+1:n][x[k+1:n] > x[k]]) - np.sum(demo[k+1:n][x[k+1:n] < x[k]])
  50. return s
  51. # original Mann-Kendal's variance S calculation
  52. def __variance_s(x, n):
  53. # calculate the unique data
  54. unique_x = np.unique(x)
  55. g = len(unique_x)
  56. # calculate the var(s)
  57. if n == g: # there is no tie
  58. var_s = (n*(n-1)*(2*n+5))/18
  59. else: # there are some ties in data
  60. tp = np.zeros(unique_x.shape)
  61. demo = np.ones(n)
  62. for i in range(g):
  63. tp[i] = np.sum(demo[x == unique_x[i]])
  64. var_s = (n*(n-1)*(2*n+5) - np.sum(tp*(tp-1)*(2*tp+5)))/18
  65. return var_s
  66. # standardized test statistic Z
  67. def __z_score(s, var_s):
  68. if s > 0:
  69. z = (s - 1)/np.sqrt(var_s)
  70. elif s == 0:
  71. z = 0
  72. elif s < 0:
  73. z = (s + 1)/np.sqrt(var_s)
  74. return z
  75. # calculate the p_value
  76. def __p_value(z, alpha):
  77. # two tail test
  78. p = 2*(1-norm.cdf(abs(z)))
  79. h = abs(z) > norm.ppf(1-alpha/2)
  80. if (z < 0) and h:
  81. trend = 'decreasing'
  82. elif (z > 0) and h:
  83. trend = 'increasing'
  84. else:
  85. trend = 'no trend'
  86. return p, h, trend
  87. def __R(x):
  88. n = len(x)
  89. R = []
  90. for j in range(n):
  91. i = np.arange(n)
  92. s = np.sum(np.sign(x[j] - x[i]))
  93. R.extend([(n + 1 + s)/2])
  94. return np.asarray(R)
  95. def __K(x,z):
  96. n = len(x)
  97. K = 0
  98. for i in range(n-1):
  99. j = np.arange(i,n)
  100. K = K + np.sum(np.sign((x[j] - x[i]) * (z[j] - z[i])))
  101. return K
  102. # Original Sens Estimator
  103. def __sens_estimator(x):
  104. idx = 0
  105. n = len(x)
  106. d = np.ones(int(n*(n-1)/2))
  107. for i in range(n-1):
  108. j = np.arange(i+1,n)
  109. d[idx : idx + len(j)] = (x[j] - x[i]) / (j - i)
  110. idx = idx + len(j)
  111. return d
  112. def sens_slope(x):
  113. """
  114. This method proposed by Theil (1950) and Sen (1968) to estimate the magnitude of the monotonic trend.
  115. Input:
  116. x: a one dimensional vector (list, numpy array or pandas series) data
  117. Output:
  118. slope: sen's slope
  119. Examples
  120. --------
  121. >>> x = np.random.rand(120)
  122. >>> slope = sens_slope(x)
  123. """
  124. x, c = __preprocessing(x)
  125. x, n = __missing_values_analysis(x, method = 'skip')
  126. return np.median(__sens_estimator(x))
  127. def seasonal_sens_slope(x, period=12):
  128. """
  129. This method proposed by Hipel (1994) to estimate the magnitude of the monotonic trend, when data has seasonal effects.
  130. Input:
  131. x: a vector (list, numpy array or pandas series) data
  132. period: seasonal cycle. For monthly data it is 12, weekly data it is 52 (12 is the default)
  133. Output:
  134. slope: sen's slope
  135. Examples
  136. --------
  137. >>> x = np.random.rand(120)
  138. >>> slope = seasonal_sens_slope(x, 12)
  139. """
  140. x, c = __preprocessing(x)
  141. n = len(x)
  142. if x.ndim == 1:
  143. if np.mod(n,period) != 0:
  144. x = np.pad(x,(0,period - np.mod(n,period)), 'constant', constant_values=(np.nan,))
  145. x = x.reshape(int(len(x)/period),period)
  146. x, n = __missing_values_analysis(x, method = 'skip')
  147. d = []
  148. for i in range(period):
  149. d.extend(__sens_estimator(x[:,i]))
  150. return np.median(np.asarray(d))
  151. def original_test(x, alpha = 0.05):
  152. """
  153. This function checks the Mann-Kendall (MK) test (Mann 1945, Kendall 1975, Gilbert 1987).
  154. Input:
  155. x: a vector (list, numpy array or pandas series) data
  156. alpha: significance level (0.05 default)
  157. Output:
  158. trend: tells the trend (increasing, decreasing or no trend)
  159. h: True (if trend is present) or False (if trend is absence)
  160. p: p-value of the significance test
  161. z: normalized test statistics
  162. Tau: Kendall Tau
  163. s: Mann-Kendal's score
  164. var_s: Variance S
  165. slope: sen's slope
  166. Examples
  167. --------
  168. >>> import pymannkendall as mk
  169. >>> x = np.random.rand(1000)
  170. >>> trend,h,p,z,tau,s,var_s,slope = mk.original_test(x,0.05)
  171. """
  172. res = namedtuple('Mann_Kendall_Test', ['trend', 'h', 'p', 'z', 'Tau', 's', 'var_s', 'slope'])
  173. x, c = __preprocessing(x)
  174. x, n = __missing_values_analysis(x, method = 'skip')
  175. s = __mk_score(x, n)
  176. var_s = __variance_s(x, n)
  177. Tau = s/(.5*n*(n-1))
  178. z = __z_score(s, var_s)
  179. p, h, trend = __p_value(z, alpha)
  180. slope = sens_slope(x)
  181. return res(trend, h, p, z, Tau, s, var_s, slope)
  182. def hamed_rao_modification_test(x, alpha = 0.05, lag=None):
  183. """
  184. This function checks the Modified Mann-Kendall (MK) test using Hamed and Rao (1998) method.
  185. Input:
  186. x: a vector (list, numpy array or pandas series) data
  187. alpha: significance level (0.05 default)
  188. lag: No. of First Significant Lags (default None, You can use 3 for considering first 3 lags, which also proposed by Hamed and Rao(1998))
  189. Output:
  190. trend: tells the trend (increasing, decreasing or no trend)
  191. h: True (if trend is present) or False (if trend is absence)
  192. p: p-value of the significance test
  193. z: normalized test statistics
  194. Tau: Kendall Tau
  195. s: Mann-Kendal's score
  196. var_s: Variance S
  197. slope: sen's slope
  198. Examples
  199. --------
  200. >>> import pymannkendall as mk
  201. >>> x = np.random.rand(1000)
  202. >>> trend,h,p,z,tau,s,var_s,slope = mk.hamed_rao_modification_test(x,0.05)
  203. """
  204. res = namedtuple('Modified_Mann_Kendall_Test_Hamed_Rao_Approach', ['trend', 'h', 'p', 'z', 'Tau', 's', 'var_s', 'slope'])
  205. x, c = __preprocessing(x)
  206. x, n = __missing_values_analysis(x, method = 'skip')
  207. s = __mk_score(x, n)
  208. var_s = __variance_s(x, n)
  209. Tau = s/(.5*n*(n-1))
  210. # Hamed and Rao (1998) variance correction
  211. if lag is None:
  212. lag = n
  213. else:
  214. lag = lag + 1
  215. # detrending
  216. # x_detrend = x - np.multiply(range(1,n+1), np.median(x))
  217. slope = sens_slope(x)
  218. x_detrend = x - np.arange(1,n+1) * slope
  219. I = rankdata(x_detrend)
  220. # account for autocorrelation
  221. acf_1 = __acf(I, nlags=lag-1)
  222. interval = norm.ppf(1 - alpha / 2) / np.sqrt(n)
  223. upper_bound = 0 + interval
  224. lower_bound = 0 - interval
  225. sni = 0
  226. for i in range(1,lag):
  227. if (acf_1[i] <= upper_bound and acf_1[i] >= lower_bound):
  228. sni = sni
  229. else:
  230. sni += (n-i) * (n-i-1) * (n-i-2) * acf_1[i]
  231. n_ns = 1 + (2 / (n * (n-1) * (n-2))) * sni
  232. var_s = var_s * n_ns
  233. z = __z_score(s, var_s)
  234. p, h, trend = __p_value(z, alpha)
  235. return res(trend, h, p, z, Tau, s, var_s, slope)
  236. def yue_wang_modification_test(x, alpha = 0.05, lag=None):
  237. """
  238. Input: This function checks the Modified Mann-Kendall (MK) test using Yue and Wang (2004) method.
  239. x: a vector (list, numpy array or pandas series) data
  240. alpha: significance level (0.05 default)
  241. lag: No. of First Significant Lags (default None, You can use 1 for considering first 1 lags, which also proposed by Yue and Wang (2004))
  242. Output:
  243. trend: tells the trend (increasing, decreasing or no trend)
  244. h: True (if trend is present) or False (if trend is absence)
  245. p: p-value of the significance test
  246. z: normalized test statistics
  247. Tau: Kendall Tau
  248. s: Mann-Kendal's score
  249. var_s: Variance S
  250. slope: sen's slope
  251. Examples
  252. --------
  253. >>> import pymannkendall as mk
  254. >>> x = np.random.rand(1000)
  255. >>> trend,h,p,z,tau,s,var_s,slope = mk.yue_wang_modification_test(x,0.05)
  256. """
  257. res = namedtuple('Modified_Mann_Kendall_Test_Yue_Wang_Approach', ['trend', 'h', 'p', 'z', 'Tau', 's', 'var_s', 'slope'])
  258. x, c = __preprocessing(x)
  259. x, n = __missing_values_analysis(x, method = 'skip')
  260. s = __mk_score(x, n)
  261. var_s = __variance_s(x, n)
  262. Tau = s/(.5*n*(n-1))
  263. # Yue and Wang (2004) variance correction
  264. if lag is None:
  265. lag = n
  266. else:
  267. lag = lag + 1
  268. # detrending
  269. slope = sens_slope(x)
  270. x_detrend = x - np.arange(1,n+1) * slope
  271. # account for autocorrelation
  272. acf_1 = __acf(x_detrend, nlags=lag-1)
  273. idx = np.arange(1,lag)
  274. sni = np.sum((1 - idx/n) * acf_1[idx])
  275. n_ns = 1 + 2 * sni
  276. var_s = var_s * n_ns
  277. z = __z_score(s, var_s)
  278. p, h, trend = __p_value(z, alpha)
  279. return res(trend, h, p, z, Tau, s, var_s, slope)
  280. def pre_whitening_modification_test(x, alpha = 0.05):
  281. """
  282. This function checks the Modified Mann-Kendall (MK) test using Pre-Whitening method proposed by Yue and Wang (2002).
  283. Input:
  284. x: a vector (list, numpy array or pandas series) data
  285. alpha: significance level (0.05 default)
  286. Output:
  287. trend: tells the trend (increasing, decreasing or no trend)
  288. h: True (if trend is present) or False (if trend is absence)
  289. p: p-value of the significance test
  290. z: normalized test statistics
  291. s: Mann-Kendal's score
  292. var_s: Variance S
  293. slope: sen's slope
  294. Examples
  295. --------
  296. >>> import pymannkendall as mk
  297. >>> x = np.random.rand(1000)
  298. >>> trend,h,p,z,tau,s,var_s,slope = mk.pre_whitening_modification_test(x,0.05)
  299. """
  300. res = namedtuple('Modified_Mann_Kendall_Test_PreWhitening_Approach', ['trend', 'h', 'p', 'z', 'Tau', 's', 'var_s', 'slope'])
  301. x, c = __preprocessing(x)
  302. x, n = __missing_values_analysis(x, method = 'skip')
  303. # PreWhitening
  304. acf_1 = __acf(x, nlags=1)[1]
  305. a = range(0, n-1)
  306. b = range(1, n)
  307. x = x[b] - x[a]*acf_1
  308. n = len(x)
  309. s = __mk_score(x, n)
  310. var_s = __variance_s(x, n)
  311. Tau = s/(.5*n*(n-1))
  312. z = __z_score(s, var_s)
  313. p, h, trend = __p_value(z, alpha)
  314. slope = sens_slope(x)
  315. return res(trend, h, p, z, Tau, s, var_s, slope)
  316. def trend_free_pre_whitening_modification_test(x, alpha = 0.05):
  317. """
  318. This function checks the Modified Mann-Kendall (MK) test using the trend-free Pre-Whitening method proposed by Yue and Wang (2002).
  319. Input:
  320. x: a vector (list, numpy array or pandas series) data
  321. alpha: significance level (0.05 default)
  322. Output:
  323. trend: tells the trend (increasing, decreasing or no trend)
  324. h: True (if trend is present) or False (if trend is absence)
  325. p: p-value of the significance test
  326. z: normalized test statistics
  327. s: Mann-Kendal's score
  328. var_s: Variance S
  329. slope: sen's slope
  330. Examples
  331. --------
  332. >>> import pymannkendall as mk
  333. >>> x = np.random.rand(1000)
  334. >>> trend,h,p,z,tau,s,var_s,slope = mk.trend_free_pre_whitening_modification_test(x,0.05)
  335. """
  336. res = namedtuple('Modified_Mann_Kendall_Test_Trend_Free_PreWhitening_Approach', ['trend', 'h', 'p', 'z', 'Tau', 's', 'var_s', 'slope'])
  337. x, c = __preprocessing(x)
  338. x, n = __missing_values_analysis(x, method = 'skip')
  339. # detrending
  340. slope = sens_slope(x)
  341. x_detrend = x - np.arange(1,n+1) * slope
  342. # PreWhitening
  343. acf_1 = __acf(x_detrend, nlags=1)[1]
  344. a = range(0, n-1)
  345. b = range(1, n)
  346. x = x_detrend[b] - x_detrend[a]*acf_1
  347. n = len(x)
  348. x = x + np.arange(1,n+1) * slope
  349. s = __mk_score(x, n)
  350. var_s = __variance_s(x, n)
  351. Tau = s/(.5*n*(n-1))
  352. z = __z_score(s, var_s)
  353. p, h, trend = __p_value(z, alpha)
  354. slope = sens_slope(x)
  355. return res(trend, h, p, z, Tau, s, var_s, slope)
  356. def multivariate_test(x, alpha = 0.05):
  357. """
  358. This function checks the Multivariate Mann-Kendall (MK) test, which is originally proposed by R. M. Hirsch and J. R. Slack (1984) for the seasonal Mann-Kendall test. Later this method also used Helsel (2006) for Regional Mann-Kendall test.
  359. Input:
  360. x: a matrix of data
  361. alpha: significance level (0.05 default)
  362. Output:
  363. trend: tells the trend (increasing, decreasing or no trend)
  364. h: True (if trend is present) or False (if trend is absence)
  365. p: p-value of the significance test
  366. z: normalized test statistics
  367. Tau: Kendall Tau
  368. s: Mann-Kendal's score
  369. var_s: Variance S
  370. slope: sen's slope
  371. Examples
  372. --------
  373. >>> import pymannkendall as mk
  374. >>> x = np.random.rand(1000)
  375. >>> trend,h,p,z,tau,s,var_s,slope = mk.multivariate_test(x,0.05)
  376. """
  377. res = namedtuple('Multivariate_Mann_Kendall_Test', ['trend', 'h', 'p', 'z', 'Tau', 's', 'var_s', 'slope'])
  378. s = 0
  379. var_s = 0
  380. denom = 0
  381. x, c = __preprocessing(x)
  382. # x, n = __missing_values_analysis(x, method = 'skip') # It makes all column at the same size
  383. for i in range(c):
  384. if c == 1:
  385. x_new, n = __missing_values_analysis(x, method = 'skip') # It makes all column at deferent size
  386. else:
  387. x_new, n = __missing_values_analysis(x[:,i], method = 'skip') # It makes all column at deferent size
  388. s = s + __mk_score(x_new, n)
  389. var_s = var_s + __variance_s(x_new, n)
  390. denom = denom + (.5*n*(n-1))
  391. Tau = s/denom
  392. z = __z_score(s, var_s)
  393. p, h, trend = __p_value(z, alpha)
  394. slope = seasonal_sens_slope(x, period = c)
  395. return res(trend, h, p, z, Tau, s, var_s, slope)
  396. def seasonal_test(x, period = 12, alpha = 0.05):
  397. """
  398. This function checks the Seasonal Mann-Kendall (MK) test (Hirsch, R. M., Slack, J. R. 1984).
  399. Input:
  400. x: a vector of data
  401. period: seasonal cycle. For monthly data it is 12, weekly data it is 52 (12 is the default)
  402. alpha: significance level (0.05 is the default)
  403. Output:
  404. trend: tells the trend (increasing, decreasing or no trend)
  405. h: True (if trend is present) or False (if trend is absence)
  406. p: p-value of the significance test
  407. z: normalized test statistics
  408. Tau: Kendall Tau
  409. s: Mann-Kendal's score
  410. var_s: Variance S
  411. slope: sen's slope
  412. Examples
  413. --------
  414. >>> import pymannkendall as mk
  415. >>> x = np.random.rand(1000)
  416. >>> trend,h,p,z,tau,s,var_s,slope = mk.seasonal_test(x,0.05)
  417. """
  418. res = namedtuple('Seasonal_Mann_Kendall_Test', ['trend', 'h', 'p', 'z', 'Tau', 's', 'var_s', 'slope'])
  419. x, c = __preprocessing(x)
  420. n = len(x)
  421. if x.ndim == 1:
  422. if np.mod(n,period) != 0:
  423. x = np.pad(x,(0,period - np.mod(n,period)), 'constant', constant_values=(np.nan,))
  424. x = x.reshape(int(len(x)/period),period)
  425. trend, h, p, z, Tau, s, var_s, slope = multivariate_test(x, alpha = alpha)
  426. return res(trend, h, p, z, Tau, s, var_s, slope)
  427. def regional_test(x, alpha = 0.05):
  428. """
  429. This function checks the Regional Mann-Kendall (MK) test (Helsel 2006).
  430. Input:
  431. x: a matrix of data
  432. alpha: significance level (0.05 default)
  433. Output:
  434. trend: tells the trend (increasing, decreasing or no trend)
  435. h: True (if trend is present) or False (if trend is absence)
  436. p: p-value of the significance test
  437. z: normalized test statistics
  438. Tau: Kendall Tau
  439. s: Mann-Kendal's score
  440. var_s: Variance S
  441. slope: sen's slope
  442. Examples
  443. --------
  444. >>> import pymannkendall as mk
  445. >>> x = np.random.rand(1000)
  446. >>> trend,h,p,z,tau,s,var_s,slope = mk.regional_test(x,0.05)
  447. """
  448. res = namedtuple('Regional_Mann_Kendall_Test', ['trend', 'h', 'p', 'z', 'Tau', 's', 'var_s', 'slope'])
  449. trend, h, p, z, Tau, s, var_s, slope = multivariate_test(x)
  450. return res(trend, h, p, z, Tau, s, var_s, slope)
  451. def correlated_multivariate_test(x, alpha = 0.05):
  452. """
  453. This function checks the Correlated Multivariate Mann-Kendall (MK) test (Libiseller and Grimvall (2002)).
  454. Input:
  455. x: a matrix of data
  456. alpha: significance level (0.05 default)
  457. Output:
  458. trend: tells the trend (increasing, decreasing or no trend)
  459. h: True (if trend is present) or False (if trend is absence)
  460. p: p-value of the significance test
  461. z: normalized test statistics
  462. Tau: Kendall Tau
  463. s: Mann-Kendal's score
  464. var_s: Variance S
  465. slope: sen's slope
  466. Examples
  467. --------
  468. >>> import pymannkendall as mk
  469. >>> x = np.random.rand(1000)
  470. >>> trend,h,p,z,tau,s,var_s,slope = mk.correlated_multivariate_test(x,0.05)
  471. """
  472. res = namedtuple('Correlated_Multivariate_Mann_Kendall_Test', ['trend', 'h', 'p', 'z', 'Tau', 's', 'var_s', 'slope'])
  473. x, c = __preprocessing(x)
  474. x, n = __missing_values_analysis(x, method = 'skip')
  475. s = 0
  476. denom = 0
  477. for i in range(c):
  478. s = s + __mk_score(x[:,i], n)
  479. denom = denom + (.5*n*(n-1))
  480. Tau = s/denom
  481. Gamma = np.ones([c,c])
  482. for i in range(1,c):
  483. for j in range(i):
  484. k = __K(x[:,i], x[:,j])
  485. ri = __R(x[:,i])
  486. rj = __R(x[:,j])
  487. Gamma[i,j] = (k + 4 * np.sum(ri * rj) - n*(n+1)**2)/3
  488. Gamma[j,i] = Gamma[i,j]
  489. for i in range(c):
  490. k = __K(x[:,i], x[:,i])
  491. ri = __R(x[:,i])
  492. rj = __R(x[:,i])
  493. Gamma[i,i] = (k + 4 * np.sum(ri * rj) - n*(n+1)**2)/3
  494. var_s = np.sum(Gamma)
  495. z = s / np.sqrt(var_s)
  496. p, h, trend = __p_value(z, alpha)
  497. slope = seasonal_sens_slope(x, period=c)
  498. return res(trend, h, p, z, Tau, s, var_s, slope)
  499. def correlated_seasonal_test(x, period = 12 ,alpha = 0.05):
  500. """
  501. This function checks the Correlated Seasonal Mann-Kendall (MK) test (Hipel [1994] ).
  502. Input:
  503. x: a matrix of data
  504. period: seasonal cycle. For monthly data it is 12, weekly data it is 52 (12 is default)
  505. alpha: significance level (0.05 default)
  506. Output:
  507. trend: tells the trend (increasing, decreasing or no trend)
  508. h: True (if trend is present) or False (if trend is absence)
  509. p: p-value of the significance test
  510. z: normalized test statistics
  511. Tau: Kendall Tau
  512. s: Mann-Kendal's score
  513. var_s: Variance S
  514. slope: sen's slope
  515. Examples
  516. --------
  517. >>> import pymannkendall as mk
  518. >>> x = np.random.rand(1000)
  519. >>> trend,h,p,z,tau,s,var_s,slope = mk.correlated_seasonal_test(x,0.05)
  520. """
  521. res = namedtuple('Correlated_Seasonal_Mann_Kendall_test', ['trend', 'h', 'p', 'z', 'Tau', 's', 'var_s', 'slope'])
  522. x, c = __preprocessing(x)
  523. n = len(x)
  524. if x.ndim == 1:
  525. if np.mod(n,period) != 0:
  526. x = np.pad(x,(0,period - np.mod(n,period)), 'constant', constant_values=(np.nan,))
  527. x = x.reshape(int(len(x)/period),period)
  528. trend, h, p, z, Tau, s, var_s, slope = correlated_multivariate_test(x)
  529. return res(trend, h, p, z, Tau, s, var_s, slope)
  530. def partial_test(x, alpha = 0.05):
  531. """
  532. This function checks the Partial Mann-Kendall (MK) test (Libiseller and Grimvall (2002)).
  533. Input:
  534. x: a matrix with 2 columns
  535. alpha: significance level (0.05 default)
  536. Output:
  537. trend: tells the trend (increasing, decreasing or no trend)
  538. h: True (if trend is present) or False (if trend is absence)
  539. p: p-value of the significance test
  540. z: normalized test statistics
  541. Tau: Kendall Tau
  542. s: Mann-Kendal's score
  543. var_s: Variance S
  544. slope: sen's slope
  545. Examples
  546. --------
  547. >>> import pymannkendall as mk
  548. >>> x = np.random.rand(1000)
  549. >>> trend,h,p,z,tau,s,var_s,slope = mk.partial_test(x,0.05)
  550. """
  551. res = namedtuple('Partial_Mann_Kendall_Test', ['trend', 'h', 'p', 'z', 'Tau', 's', 'var_s', 'slope'])
  552. x_old, c = __preprocessing(x)
  553. x_old, n = __missing_values_analysis(x_old, method = 'skip')
  554. if c != 2:
  555. raise ValueError('Partial Mann Kendall test required two parameters/columns. Here column no ' + str(c) + ' is not equal to 2.')
  556. x = x_old[:,0]
  557. y = x_old[:,1]
  558. x_score = __mk_score(x, n)
  559. y_score = __mk_score(y, n)
  560. k = __K(x, y)
  561. rx = __R(x)
  562. ry = __R(y)
  563. sigma = (k + 4 * np.sum(rx * ry) - n*(n+1)**2)/3
  564. rho = sigma / (n*(n-1)*(2*n+5)/18)
  565. s = x_score - rho * y_score
  566. var_s = (1 - rho**2) * (n*(n-1)*(2*n+5))/18
  567. Tau = x_score/(.5*n*(n-1))
  568. z = s / np.sqrt(var_s)
  569. p, h, trend = __p_value(z, alpha)
  570. slope = sens_slope(x)
  571. return res(trend, h, p, z, Tau, s, var_s, slope)

批量逐点检验(未处理自相关)

scipy.stats.kendalltau() 函数

Kendall's tau-b(肯德尔)等级相关系数:用于反映分类变量相关性的指标,适用于两个分类变量(时间—水文要素)均为有序分类的情况。对相关的有序变量进行非参数相关检验;取值范围在-1-1之间,此检验适合于正方形表格;

scipy.stats.kendalltau — SciPy v0.19.1 Reference Guide

  1. from scipy import stats
  2. import pandas as pd
  3. import numpy as np
  4. data = pd.read_csv(r"C:\Users\Leon\Desktop\Pre.csv")
  5. #print (data)
  6. ###38行*994列(38年994个cell)
  7. x = range(38)
  8. print (x)
  9. y = np.zeros((0))
  10. for j in range(994):
  11. b = stats.kendalltau(x,data.values[:,j]) ##MK检验,结果包含两个参数:tau, p_value
  12. y = np.append(y, b, axis=0)
  13. print(b)
  14. print(type(y))
  15. #np.savetxt("C:/Users/Leon/Desktop/P.txt",y) ##保存ndarray类型数据

稳健回归(Robustness regression)


最小二乘法的弊端


之前文章里的关于线性回归的模型,都是基于最小二乘法来实现的。但是,当数据样本点出现很多的异常点(outliers),这些异常点对回归模型的影响会非常的大,传统的基于最小二乘的回归方法将不适用。

比如下图中所示,数据中存在一个异常点,如果不剔除改点,适用OLS方法来做回归的话,那么就会得到途中红色的那条线;如果将这个异常点剔除掉的话,那么就可以得到图中蓝色的那条线。显然,蓝色的线比红色的线对数据有更强的解释性,这就是OLS在做回归分析时候的弊端。

当然,可以考虑在做回归分析之前,对数据做预处理,剔除掉那些异常点。但是,在实际的数据中,存在两个问题:

异常点并不能很好的确定,并没有一个很好的标准用于确定哪些点是异常点
即便确定了异常点,但这些被确定为异常的点,真的是错误的数据吗?很有可能这看似异常的点,就是原始模型的数据,如果是这样的话,那么这些异常的点就会带有大量的原始模型的信息,剔除之后就会丢失大量的信息。
再比如下面这幅图,其中红色的都是异常点,但是很难从数据中剔除出去。 

稳健回归


稳健回归(Robust regression),就是当最小二乘法遇到上述的,数据样本点存在异常点的时候,用于代替最小二乘法的一个算法。当然,稳健回归还可以用于异常点检测,或者是找出那些对模型影响最大的样本点。

Breakdown point


关于稳健回归,有一个名词需要做解释:Breakdown point,这个名词我并不想翻译,我也没找到一个很好的中文翻译。对于一个估计器而言,原始数据中混入了脏数据,那么,Breakdown point 指的就是在这个估计器给出错误的模型估计之前,脏数据最大的比例 αα,Breakdown point 代表的是一个估计器对脏数据的最大容忍度。


这个均值估计器的Breakdown point 为0,因为使任意一个xixi变成足够大的脏数据之后,上面估计出来的均值,就不再正确了。

毫无疑问,Breakdown point越大,估计器就越稳健。

Breakdown point 是不可能达到 50% 的,因为如果总体样本中超过一半的数据是脏数据了,那么从统计上来说,就无法将样本中的隐藏分布和脏数据的分布给区分开来。

本文主要介绍两种稳健回归模型:RANSAC(RANdom SAmple Consensus 随机采样一致性)和Theil-Sen estimator。

RANSAC随机采样一致性算法


RANSAC算法的输入是一组观测数据(往往含有较大的噪声或无效点),它是一种重采样技术(resampling technique),通过估计模型参数所需的最小的样本点数,来得到备选模型集合,然后在不断的对集合进行扩充,其算法步骤为:


RANSAC算法是从输入样本集合的内点的随机子集中学习模型。

RANSAC算法是一个非确定性算法(non-deterministic algorithm),这个算法只能得以一定的概率得到一个还不错的结果,在基本模型已定的情况下,结果的好坏程度主要取决于算法最大的迭代次数。

RANSAC算法在线性和非线性回归中都得到了广泛的应用,而其最典型也是最成功的应用,莫过于在图像处理中处理图像拼接问题,这部分在Opencv中有相关的实现。

从总体上来讲,RANSAC算法将输入样本分成了两个大的子集:内点(inliers)和外点(outliers)。其中内点的数据分布会受到噪声的影响;而外点主要来自于错误的测量手段或者是对数据错误的假设。而RANSAC算法最终的结果是基于算法所确定的内点集合得到的。

下面这份代码是RANSAC的适用实例:
 

  1. # -*- coding: utf-8 -*-
  2. """
  3. author : duanxxnj@163.com
  4. time : 2016-07-07-15-36
  5. """
  6. import numpy as np
  7. import time
  8. from sklearn import linear_model,datasets
  9. import matplotlib.pyplot as plt
  10. # 产生数据样本点集合
  11. # 样本点的特征X维度为1维,输出y的维度也为1维
  12. # 输出是在输入的基础上加入了高斯噪声N(0,10)
  13. # 产生的样本点数目为1000个
  14. n_samples = 1000
  15. X, y, coef = datasets.make_regression(n_samples=n_samples,
  16. n_features=1,
  17. n_informative=1,
  18. noise=10,
  19. coef=True,
  20. random_state=0)
  21. # 将上面产生的样本点中的前50个设为异常点(外点)
  22. # 即:让前50个点偏离原来的位置,模拟错误的测量带来的误差
  23. n_outliers = 50
  24. np.random.seed(int(time.time()) % 100)
  25. X[:n_outliers] = 3 + 0.5 * np.random.normal(size=(n_outliers, 1))
  26. y[:n_outliers] = -3 + 0.5 * np.random.normal(size=n_outliers)
  27. # 用普通线性模型拟合X,y
  28. model = linear_model.LinearRegression()
  29. model.fit(X, y)
  30. # 使用RANSAC算法拟合X,y
  31. model_ransac = linear_model.RANSACRegressor(linear_model.LinearRegression())
  32. model_ransac.fit(X, y)
  33. inlier_mask = model_ransac.inlier_mask_
  34. outlier_mask = np.logical_not(inlier_mask)
  35. # 使用一般回归模型和RANSAC算法分别对测试数据做预测
  36. line_X = np.arange(-5, 5)
  37. line_y = model.predict(line_X[:, np.newaxis])
  38. line_y_ransac = model_ransac.predict(line_X[:, np.newaxis])
  39. print "真实数据参数:", coef
  40. print "线性回归模型参数:", model.coef_
  41. print "RANSAC算法参数: ", model_ransac.estimator_.coef_
  42. plt.plot(X[inlier_mask], y[inlier_mask], '.g', label='Inliers')
  43. plt.plot(X[outlier_mask], y[outlier_mask], '.r', label='Outliers')
  44. plt.plot(line_X, line_y, '-k', label='Linear Regression')
  45. plt.plot(line_X, line_y_ransac, '-b', label="RANSAC Regression")
  46. plt.legend(loc='upper left')
  47. plt.show()

figure_1-1.png-46.8kB

运行结果为:

  1. 真实数据参数: 82.1903908408
  2. 线性回归模型参数: [ 55.19291974]
  3. RANSAC算法参数: [ 82.08533159]

Theil-Sen Regression 泰尔森回归

Theil-Sen回归是一个参数中值估计器,它适用泛化中值,对多维数据进行估计,因此其对多维的异常点(outliers 外点)有很强的稳健性。


在实践中发现,随着数据特征维度的提升,Theil-Sen回归的效果不断的下降,在高维数据中,Theil-Sen回归的效果有时甚至还不如OLS(最小二乘)。

在之间的文章《线性回归》中讨论过,OLS方法是渐进无偏的,Theil-Sen方法在渐进无偏方面和OLS性能相似。和OLS方法不同的是,Theil-Sen方法是一种非参数方法,其对数据的潜在分布不做任何的假设。Theil-Sen方法是一种基于中值的估计其,所以其对异常点有更强的稳健性。

在单变量回归问题中,Theil-Sen方法的Breakdown point为29.3%,也就是说,Theil-Sen方法可以容忍29.3%的数据是outliers。

  1. # -*- coding: utf-8 -*-
  2. """
  3. @author : duanxxnj@163.com
  4. @time ;2016-07-08_08-50
  5. Theil-Sen 回归
  6. 本例生成一个数据集,然后在该数据集上测试Theil-Sen回归
  7. """
  8. print __doc__
  9. import time
  10. import numpy as np
  11. import matplotlib.pyplot as plt
  12. from sklearn.linear_model import LinearRegression, TheilSenRegressor,\
  13. RANSACRegressor
  14. estimators = [('OLS', LinearRegression()),
  15. ('Theil-Sen', TheilSenRegressor())]
  16. # 异常值仅仅出现在y轴
  17. np.random.seed((int)(time.time() % 100))
  18. n_samples = 200
  19. # 线性模型的函数形式为: y = 3 * x + N(2, .1 ** 2)
  20. x = np.random.randn(n_samples)
  21. w = 3.
  22. c = 2.
  23. noise = c + 0.1 * np.random.randn(n_samples)
  24. y = w * x + noise
  25. # 加入10%的异常值,最后20个值称为异常值
  26. y[-20:] += -20 * x[-20:]
  27. X = x[:, np.newaxis]
  28. plt.plot(X, y, 'k+', mew=2, ms=8)
  29. line_x = np.array([-3, 3])
  30. for name, estimator in estimators:
  31. t0 = time.time()
  32. estimator.fit(X, y)
  33. elapsed_time = time.time() - t0
  34. y_pred = estimator.predict(line_x.reshape(2, 1))
  35. plt.plot(line_x, y_pred, label='%s (fit time: %.2fs)'
  36. %(name, elapsed_time))
  37. plt.axis('tight')
  38. plt.legend(loc='upper left')
  39. plt.show()

figure_1-1.png-32.2kB

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/花生_TL007/article/detail/592327
推荐阅读
相关标签
  

闽ICP备14008679号