当前位置:   article > 正文

Timsort 介绍(listsort.txt 翻译)_临时插槽hd无字

临时插槽hd无字

Cpython 3.9 listsort.txt 原文
Cpython 3.9 listobject.c 源码

译文

简介(Intro)

This describes an adaptive, stable, natural mergesort, modestly called timsort (hey, I earned it ). It has supernatural performance on many kinds of partially ordered arrays (less than lg(N!) comparisons needed, and as few as N-1), yet as fast as Python’s previous highly tuned samplesort hybrid on random arrays.

这里描述了一种自适应的、稳定的、自然的 mergesort,谦虚地称之为 timsort (嘿,这是我应得的 <眨眼>)。它对部分有序的数组具有超高的性能(所需比较次数小于 lg(N!),并且最低能到 N-1),对于随机数组,它的速度与 Python 之前精心调优过的 samplesort 混合排序 一样快。(Python 版 SampleSort 算法

In a nutshell, the main routine marches over the array once, left to right, alternately identifying the next run, then merging it into the previous runs “intelligently”. Everything else is complication for speed, and some hard-won measure of memory efficiency.

简而言之,主流程是遍历数组一次,从左到右,依次识别出下一个 run,然后“智能地”将其合并到之前的那些 run 上。除此之外的一切都是为了提升速度的处理,以及对内存效率的一些来之不易的度量。

与 Python 样本混合排序的比较(Comparison with Python’s Samplesort Hybrid)

  • timsort can require a temp array containing as many as N//2 pointers, which means as many as 2*N extra bytes on 32-bit boxes. It can be expected to require a temp array this large when sorting random data; on data with significant structure, it may get away without using any extra heap memory. This appears to be the strongest argument against it, but compared to the size of an object, 2 temp bytes worst-case (also expected-case for random data) doesn’t scare me much.
  • timsort 可能需要一个包含多达 N//2 个指针的临时数组,这意味着在 32 位的机器中多达 2*N 个额外字节。在对随机数据进行排序时,可以预期需要这么大的临时数组;对于具有明显结构的数据,它可能不需要使用任何额外的堆内存。这似乎是反对它的最强有力的论据,但与对象的大小相比,2 个临时字节的最坏情况(也是随机数据的预期情况)并不会吓到我。

It turns out that Perl is moving to a stable mergesort, and the code for that appears always to require a temp array with room for at least N pointers. (Note that I wouldn’t want to do that even if space weren’t an issue; I believe its efforts at memory frugality also save timsort significant pointer-copying costs, and allow it to have a smaller working set.)

结果证明,Perl 正在转向一个稳定的 mergesort,其代码似乎总是需要一个临时数组,其中至少有 N 个指针的空间。(请注意,即使空间不是问题,我也不想这样做;我相信在内存节约方面的努力也为 timsort 节省了大量的指针复制成本,并允许它拥有更小的工作集。)

  • Across about four hours of generating random arrays, and sorting them under both methods, samplesort required about 1.5% more comparisons (the program is at the end of this file).
  • 生成随机数组并在两种方法下对它们进行排序大约花费了我 4个小时的时间,samplesort 所需要的比较次数大约多出了1.5% (程序在这个文件的末尾)。
  • In real life, this may be faster or slower on random arrays than samplesort was, depending on platform quirks. Since it does fewer comparisons on average, it can be expected to do better the more expensive a comparison function is. OTOH, it does more data movement (pointer copying) than samplesort, and that may negate its small comparison advantage (depending on platform quirks) unless comparison is very expensive.
  • 在现实生活中,对于随机数组,它可能比 samplesort 的速度更快或更慢,取决于平台的特殊性。由于平均而言它的比较次数更少,因此可以预期,比较函数越昂贵,它的表现就会越好。另一方面,它比 samplesort 的数据移动(指针复制)次数更多 ,这可能会抵消其在比较次数上的小幅优势(取决于平台的特殊性),除非比较非常昂贵。
  • On arrays with many kinds of pre-existing order, this blows samplesort out of the water. It’s significantly faster than samplesort even on some cases samplesort was special-casing the snot out of. I believe that lists very often do have exploitable partial order in real life, and this is the strongest argument in favor of timsort (indeed, samplesort’s special cases for extreme partial order are appreciated by real users, and timsort goes much deeper than those, in particular naturally covering every case where someone has suggested “and it would be cool if list.sort() had a special case for this too … and for that …”).
  • 对于那些本身已存在部分有序性的数组,它能彻底击败 samplesort 。即使在那些被 samplesort 划为特殊案例的情况下,它也明显快于 samplesort。我相信现实生活的数组中经常有可利用的部分顺序,这是支持 timsort 的最强有力的论据(实际上,现实用户喜欢 samplesort 把极端部分有序数据作为特殊案例来处理,而 timsort 要比那些情况深入得多,特别是自然地涵盖了每一个有人建议的情况,“如果 list.sort () 也有这样的特殊案例… … 那就太酷了… …”)。
  • Here are exact comparison counts across all the tests in sortperf.py, when run with arguments “15 20 1”.

下面是 sortperf.py 中所有测试的精确比较次数,使用参数"15 20 1"运行。

  Column Key:
      *sort: random data 随机数据
      \sort: descending data 降序数据
      /sort: ascending data 升序数据
      3sort: ascending, then 3 random exchanges 升序且有3次随机交换的数据
      +sort: ascending, then 10 random at the end 升序且末尾10个是随机数据
      %sort: ascending, then randomly replace 1% of elements w/ random values 升序且随机把 1% 的元素替换成随机值
      ~sort: many duplicates 有很多重复值的数据
      =sort: all equal 全部等值的数据
      !sort: worst case scenario 最差情形
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

First the trivial cases, trivial for samplesort because it special-cased them, and trivial for timsort because it naturally works on runs. Within an “n” block, the first line gives the # of compares done by samplesort, the second line by timsort, and the third line is the percentage by which the samplesort count exceeds the timsort count:

首先是不重要的案例,对于 samplesort 来说不重要,是因为 samplesort 将它们划分为特殊案例,对于 timsort 来说不重要,是因为它能构成自然 run 。在下表的一个 “n” 块中,第一行给出了 samplesort 的比较次数,第二行给出了 timsort 的比较次数,第三行给出了 samplesort 的比较次数多于 timsort 的百分比:

      n   \sort   /sort   =sort
-------  ------  ------  ------
  32768   32768   32767   32767  samplesort
          32767   32767   32767  timsort
          0.00%   0.00%   0.00%  (samplesort - timsort) / timsort

  65536   65536   65535   65535
          65535   65535   65535
          0.00%   0.00%   0.00%

 131072  131072  131071  131071
         131071  131071  131071
          0.00%   0.00%   0.00%

 262144  262144  262143  262143
         262143  262143  262143
          0.00%   0.00%   0.00%

 524288  524288  524287  524287
         524287  524287  524287
          0.00%   0.00%   0.00%

1048576 1048576 1048575 1048575
        1048575 1048575 1048575
          0.00%   0.00%   0.00%
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

The algorithms are effectively identical in these cases, except that timsort does one less compare in \sort.

在这些案例中,两个算法的表现相同,除了 timsort 在降序数据案例中(\sort)少了一次比较。

Now for the more interesting cases. Where lg(x) is the logarithm of x to the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for the best any comparison-based sorting algorithm can do on average (across all permutations). When a method gets significantly below that, it’s either astronomically lucky, or is finding exploitable structure in the data.

现在来看看更有趣的案例。 lg(n!) 是任何基于比较的排序算法平均能做到的最好的信息理论极限(所有排列),其中 lg(x) 是以 2 为底的 x 的对数(例如,lg(8) = 3)。如果一种方法远远低于这个值,那么它要么是天文数字般的幸运,要么是在数据中发现了可利用的结构。

      n   lg(n!)    *sort    3sort     +sort   %sort    ~sort     !sort
-------  -------   ------   -------  -------  ------  -------  --------
  32768   444255   453096   453614    32908   452871   130491    469141 old
                   448885    33016    33007    50426   182083     65534 new
                    0.94% 1273.92%   -0.30%  798.09%  -28.33%   615.87% %ch from new

  65536   954037   972699   981940    65686   973104   260029   1004607
                   962991    65821    65808   101667   364341    131070
                    1.01% 1391.83%   -0.19%  857.15%  -28.63%   666.47%

 131072  2039137  2101881  2091491   131232  2092894   554790   2161379
                  2057533   131410   131361   206193   728871    262142
                    2.16% 1491.58%   -0.10%  915.02%  -23.88%   724.51%

 262144  4340409  4464460  4403233   262314  4445884  1107842   4584560
                  4377402   262437   262459   416347  1457945    524286
                    1.99% 1577.82%   -0.06%  967.83%  -24.01%   774.44%

 524288  9205096  9453356  9408463   524468  9441930  2218577   9692015
                  9278734   524580   524633   837947  2916107   1048574
                   1.88%  1693.52%   -0.03% 1026.79%  -23.92%   824.30%

1048576 19458756 19950272 19838588  1048766 19912134  4430649  20434212
                 19606028  1048958  1048941  1694896  5832445   2097150
                    1.76% 1791.27%   -0.02% 1074.83%  -24.03%   874.38%
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

Discussion of cases:

*sort: There’s no structure in random data to exploit, so the theoretical limit is lg(n!). Both methods get close to that, and timsort is hugging it (indeed, in a marginal sense, it’s a spectacular improvement – there’s only about 1% left before hitting the wall, and timsort knows darned well it’s doing compares that won’t pay on random data – but so does the samplesort hybrid). For contrast, Hoare’s original random-pivot quicksort does about 39% more compares than the limit, and the median-of-3 variant about 19% more.

案例讨论:

*sort:在随机数据中没有结构可以利用,因此理论上的极限是 lg(n!)。两种算法的表现都接近这个理论极限,并且 timsort 更接近一点(的确,在边际意义上,这是一个惊人的进步 —— 与理论极限值只差大约 1%,timsort 非常清楚它正在对不会有任何回报的随机数据做比较 —— 但是 samplesort 混合排序也清楚)。相比之下,Hoare 最初提出的随机选取基准值版本的 quicksort 比理论极限值多出了39% ,而三数取中版本的 quicksort 多出了 19% 。

3sort, %sort, and !sort: No contest; there’s structure in this data, but not of the specific kinds samplesort special-cases. Note that structure in !sort wasn’t put there on purpose – it was crafted as a worst case for a previous quicksort implementation. That timsort nails it came as a surprise to me (although it’s obvious in retrospect).

3sort%sort!sort:毫无疑问,这些数据中有结构,但没有那些 samplesort 的特殊案例。注意 !sort中的结构不是故意放在那里的 —— 它是作为之前的 quicksort 实现的最坏情况而设计的。timsort 对此情况的优异表现让我感到惊讶(尽管回想起来很明显)。

+sort: samplesort special-cases this data, and does a few less compares than timsort. However, timsort runs this case significantly faster on all boxes we have timings for, because timsort is in the business of merging runs efficiently, while samplesort does much more data movement in this (for it) special case.

+sort: samplesort 把这类数据划分为特殊案例,并且比较次数比 timsort 还少一些。然而,在所有我们计时的机器上,timsort 对此类数据的排序都明显更快,因为 timsort 能高效合并 run,而 samplesort 对这类数据需要进行更多次的数据移动。

~sort: samplesort’s special cases for large masses of equal elements are extremely effective on ~sort’s specific data pattern, and timsort just isn’t going to get close to that, despite that it’s clearly getting a great deal of benefit out of the duplicates (the # of compares is much less than lg(n!)). ~sort has a perfectly uniform distribution of just 4 distinct values, and as the distribution gets more skewed, samplesort’s equal-element gimmicks become less effective, while timsort’s adaptive strategies find more to exploit; in a database supplied by Kevin Altis, a sort on its highly skewed “on which stock exchange does this company’s stock trade?” field ran over twice as fast under timsort.

~sort: samplesort 为大量相等元素设计的特殊案例,对 ~sort 模式的数据来说是非常高效的,尽管 timssort 显然从重复数据中获益颇多(比较次数比 lg(n!) 少很多) ,但它的表现甚至都不接近 samplesort 。~sort 模式的数据是由4个不同的值完全均匀分布构成的,并且随着分布变得更加倾斜,samplesort 对等值元素的处理技巧就会变得不那么有效了,而 timsort 的适应策略能发现更多可利用的东西;在 Kevin Altis 提供的数据库中,对一个高度倾斜的有大量重复值的数据进行排序时(“这家公司的股票交易是在哪家证券交易所进行的?”),timsort 快了两倍。

However, despite that timsort does many more comparisons on ~sort, and that on several platforms ~sort runs highly significantly slower under timsort, on other platforms ~sort runs highly significantly faster under timsort. No other kind of data has shown this wild x-platform behavior, and we don’t have an explanation for it. The only thing I can think of that could transform what “should be” highly significant slowdowns into highly significant speedups on some boxes are catastrophic cache effects in samplesort.

然而,尽管 timsort 在 ~sort 上进行的比较次数更多,而且在一些平台上 ~sort 在 timsort 下运行更慢,但在其他平台上 ~sort 在 timsort 下运行更快。其他类型的数据都没有表现出这种疯狂的平台差异,对此我们也没法解释。我唯一能想到的可以将“本应”高度显著减速的在某些机器上转化为高度显著加速的,是 samplesort 中的灾难性缓存效应。

But timsort “should be” slower than samplesort on ~sort, so it’s hard to count that it isn’t on some boxes as a strike against it .

但是对于~sort类型的数据,timsort “本应”比 samplesort 更慢,所以很难把不在某些机器上运行就表现不佳算作对它的打击 <眨眼> 。

  • Here’s the highwater mark for the number of heap-based temp slots (4 bytes each on this box) needed by each test, again with arguments “15 20 1”:
  • 下面是每个测试所需的基于堆的临时插槽数(每个插槽4字节在此机器上)的最高值,同样使用参数 “15 20 1”:
   2**i  *sort \sort /sort  3sort  +sort  %sort  ~sort  =sort  !sort
  32768  16384     0     0   6256      0  10821  12288      0  16383
  65536  32766     0     0  21652      0  31276  24576      0  32767
 131072  65534     0     0  17258      0  58112  49152      0  65535
 262144 131072     0     0  35660      0 123561  98304      0 131071
 524288 262142     0     0  31302      0 212057 196608      0 262143
1048576 524286     0     0 312438      0 484942 393216      0 524287
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

Discussion: The tests that end up doing (close to) perfectly balanced merges (*sort, !sort) need all N//2 temp slots (or almost all). ~sort also ends up doing balanced merges, but systematically benefits a lot from the preliminary pre-merge searches described under “Merge Memory” later. %sort approaches having a balanced merge at the end because the random selection of elements to replace is expected to produce an out-of-order element near the midpoint. \sort, /sort, =sort are the trivial one-run cases, needing no merging at all. +sort ends up having one very long run and one very short, and so gets all the temp space it needs from the small temparray member of the MergeState struct (note that the same would be true if the new random elements were prefixed to the sorted list instead, but not if they appeared “in the middle”). 3sort approaches N//3 temp slots twice, but the run lengths that remain after 3 random exchanges clearly has very high variance.

讨论:那些最后以(近似)完美平衡的归并收尾的测试(*sort!sort)需要所有 N//2 个临时插槽(或几乎所有)。~sort 也是最后以平衡归并收尾,但系统地受益于合并前的初步查找(详见后续章节"Merge Memory"的描述)。%sort 方法在结尾处有一个平衡的合并,因为随机选择要替换的元素会在中点附近产生一个无序的元素。 \sort, /sort, =sort 是只有一个 run 的不重要案例,根本不需要合并。+sort 最终有一个非常长的 run 和一个非常短的 run,因此从 MergeState 结构的小 temparray 成员获得所需的所有临时空间(请注意,如果新的随机元素位于排序数组的首部位置,结果也是相同的,但如果它们出现在“中间”位置,则结果不同)。3sort方法两次逼近 N//3 个临时插槽,但 3 次随机交换后剩余的 run 长度方差明显很大。

A detailed description of timsort follows.

以下是对 timsort 流程的详细描述。

什么是 run(Runs)

count_run() returns the # of elements in the next run. A run is either “ascending”, which means non-decreasing:

count_run() 返回下一个 run 中元素的个数。一个 run 要么是"升序",意思是非递减:

a0 <= a1 <= a2 <= ...
  • 1

or “descending”, which means strictly decreasing:

要么是"降序",意思是严格递减:

a0 > a1 > a2 > ...
  • 1

Note that a run is always at least 2 long, unless we start at the array’s last element.

请注意,一个 run 的长度总是至少为 2,除非我们从数组的最后一个元素开始。

The definition of descending is strict, because the main routine reverses a descending run in-place, transforming a descending run into an ascending run. Reversal is done via the obvious fast “swap elements starting at each end, and converge at the middle” method, and that can violate stability if the slice contains any equal elements. Using a strict definition of descending ensures that a descending run contains distinct elements.

降序的定义是严格的,因为主程序会就地反转降序的 run,将降序的 run 转换成升序。反转是通过显而易见的快速方法来完成的(“从两端开始交换元素,并收敛到中间”),如果降序 run 中包含任何等值元素,可能会破坏稳定性。使用严格定义的降序可以确保降序 run 只包含不等值的元素。

If an array is random, it’s very unlikely we’ll see long runs. If a natural run contains less than minrun elements (see next section), the main loop artificially boosts it to minrun elements, via a stable binary insertion sort applied to the right number of array elements following the short natural run. In a random array, all runs are likely to be minrun long as a result. This has two primary good effects:

如果一个数组是随机的,我们不太可能看到很长的 run。如果一个自然的 run 包含的元素个数少于 minrun(详见下一节),那么主循环将通过对这个自然 run 右侧的元素调用稳定的二分插入排序算法,人为地将它扩充到 minrun 个元素。在一个随机数组中,很可能所有 run 的长度都等于 minrun。这有两个主要的好处:

  1. Random data strongly tends then toward perfectly balanced (both runs have the same length) merges, which is the most efficient way to proceed when data is random.
  1. 随机数据强烈地趋向于完全平衡的归并(两个 run 具有相同的长度),这是处理随机数据的最有效的方法。
  1. Because runs are never very short, the rest of the code doesn’t make heroic efforts to shave a few cycles off per-merge overheads. For example, reasonable use of function calls is made, rather than trying to inline everything. Since there are no more than N/minrun runs to begin with, a few “extra” function calls per merge is barely measurable.
  1. 因为 run 从来都不会很短,所以代码的其余部分不会做出艰苦卓绝的努力去削减一些“每次合并都有的开销”(per-merge overheads)。例如,合理地使用函数调用,而不是尝试对一切都使用 inline 关键字。由于一开始就没有超过 N/minrun 个的 run,因此每次合并时使用一些“额外的”函数调用也可以忽略不计。

计算 run 的最小长度(Computing minrun)

If N < 64, minrun is N. IOW, binary insertion sort is used for the whole array then; it’s hard to beat that given the overheads of trying something fancier (see note BINSORT).

如果 N < 64,minrun 就等于 N。 换句话说,使用二分插入排序对整个数组排序;考虑到其他花哨方法的额外开销,很难有其他算法能超过二分插入排序(参见 BINSORT 注释)。

When N is a power of 2, testing on random data showed that minrun values of 16, 32, 64 and 128 worked about equally well. At 256 the data-movement cost in binary insertion sort clearly hurt, and at 8 the increase in the number of function calls clearly hurt. Picking some power of 2 is important here, so that the merges end up perfectly balanced (see next section). We pick 32 as a good value in the sweet range; picking a value at the low end allows the adaptive gimmicks more opportunity to exploit shorter natural runs.

当 N 是 2 的幂次方时,对随机数据的测试表明,以 16、32、64、128 作为 minrun 的效果都同样好。当选择 256 时,二分插入排序的数据移动代价会明显损害性能,当选择 8 时,函数调用次数的增加会明显损害性能。只能在 2 的幂次方中选择 minrun 是很重要的,这样归并才能达到完美的平衡(详见下一节)。在上述范围中我们选择了 32 ;选择一个较小的值将使得自适应策略能有更多机会利用更短的自然 run。

Because sortperf.py only tries powers of 2, it took a long time to notice that 32 isn’t a good choice for the general case! Consider N=2112:

由于 sortperf.py 只尝试了 2 的幂次方,所以花了很长时间才注意到 32 对于一般情况不是一个好的选择!考虑 N=2112 的情况:

>>> divmod(2112, 32)
(66, 0)
>>>
  • 1
  • 2
  • 3

If the data is randomly ordered, we’re very likely to end up with 66 runs each of length 32. The first 64 of these trigger a sequence of perfectly balanced merges (see next section), leaving runs of lengths 2048 and 64 to merge at the end. The adaptive gimmicks can do that with fewer than 2048+64 compares, but it’s still more compares than necessary, and-- mergesort’s bugaboo relative to samplesort --a lot more data movement (O(N) copies just to get 64 elements into place).

如果数据是随机排列的,我们很可能最终得到 66 个长度都为 32 的 run。其中的前 64 个触发了一系列完全平衡的合并(详见下一节) ,最后留下长度为 2048 和 64 的两个 run 进行合并。自适应策略可以在少于 2048 + 64 的比较次数下完成合并,但仍然多于必要的次数,而且—— mergesort 相对于 samplesort 的错误之处 —— 过多次的数据移动(O(N) 次复制只是为了把 64 个元素移动到正确位置)。

If we take minrun=33 in this case, then we’re very likely to end up with 64 runs each of length 33, and then all merges are perfectly balanced. Better!

在这种情况下,如果我们取 minrun=33,那么我们很有可能得到 64 个长度为 33 的 run,然后所有的合并都是完全平衡的。好多了!

What we want to avoid is picking minrun such that in

我们想要避免的是 minrun 导致

q, r = divmod(N, minrun)
  • 1

q is a power of 2 and r>0 (then the last merge only gets r elements into place, and r < minrun is small compared to N), or q a little larger than a power of 2 regardless of r (then we’ve got a case similar to “2112”, again leaving too little work for the last merge to do).

q 是 2 的幂次方,r > 0(最后一次合并只让 r 个元素就位,而且 r < minrun 比 N 小) ,或者 q 比 2 的幂次方大一点,而不管 r 是多少(然后我们得到了一个类似于"2112"的情况,为最后一次合并留下了太少的工作)。

Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a power of 2, or if that isn’t possible, is close to, but strictly less than, a power of 2. This is easier to do than it may sound: take the first 6 bits of N, and add 1 if any of the remaining bits are set. In fact, that rule covers every case in this section, including small N and exact powers of 2; merge_compute_minrun() is a deceptively simple function.

取而代之的是,我们在 range(32,65) 中选择一个值作为 minrun,使得 N/minrun 正好是 2 的幂次方,或者如果不可能的话,选择一个 minrun 使得 N/minrun 接近但严格小于 2 的幂次方。这比听起来容易做到:取 N 的前 6 位,如果剩下的任何一位已经设置好(除去前 6 位的剩余位中有任何一位等于 1 ),则加1。实际上,这个规则涵盖了本节中的所有情况,包括小 N 和精确等于 2 的幂次方; merge_compute_minrun() 是一个看似简单的函数。

合并模式(The Merge Pattern)

In order to exploit regularities in the data, we’re merging on natural run lengths, and they can become wildly unbalanced. That’s a Good Thing for this sort! It means we have to find a way to manage an assortment of potentially very different run lengths, though.

为了利用数据中的规律性,我们按自然 run 的长度进行合并,它们可能变得极不平衡。对 timsort 来说,这是件好事!这意味着我们必须找到一种方法来管理各种可能非常不同的长度。

Stability constrains permissible merging patterns. For example, if we have 3 consecutive runs of lengths

稳定性限制了允许的合并模式。例如,如果我们有 3 个连续的 run,其长度如下:

A:10000  B:20000  C:10000
  • 1

we dare not merge A with C first, because if A, B and C happen to contain a common element, it would get out of order wrt its occurrence(s) in B. The merging must be done as (A+B)+C or A+(B+C) instead.

我们不敢首先将 A 和 C 合并,因为如果 A、 B 和 C 恰好包含一个公共元素,那么它在 B 中的出现顺序就乱了。合并必须以 (A+B)+CA+(B+C) 的形式进行。

So merging is always done on two consecutive runs at a time, and in-place, although this may require some temp memory (more on that later).

因此,一次合并始终是对两个相邻的 run 进行的,而且是就地进行的,尽管这可能需要一些临时内存(稍后将详细介绍)。

When a run is identified, its base address and length are pushed on a stack in the MergeState struct. merge_collapse() is then called to see whether it should merge it with preceding run(s). We would like to delay merging as long as possible in order to exploit patterns that may come up later, but we like even more to do merging as soon as possible to exploit that the run just found is still high in the memory hierarchy. We also can’t delay merging “too long” because it consumes memory to remember the runs that are still unmerged, and the stack has a fixed size.

当识别出一个 run 时,它的开始指针和长度将被推送到 MergeState 结构中的堆栈中。然后调用 merge_collapse() ,以便看看它是否应该与已有的 run 合并。我们希望尽可能地延迟合并,以便利用以后可能出现的模式,但我们更希望尽快地进行合并,以利用刚刚发现的 run 所处内存等级仍然很高这一情况。我们也不能延迟合并“太久”,因为它需要消耗内存来记住仍未合并的 run,而且堆栈有固定的大小。

What turned out to be a good compromise maintains two invariants on the stack entries, where A, B and C are the lengths of the three rightmost not-yet merged slices:

一个很好的折衷方案是,在堆栈上保持两个不变量,其中 A、B 和 C 是最右边三个尚未合并的 run 的长度:

1.  A > B+C
2.  B > C
  • 1
  • 2

Note that, by induction, #2 implies the lengths of pending runs form a decreasing sequence. #1 implies that, reading the lengths right to left, the pending-run lengths grow at least as fast as the Fibonacci numbers. Therefore the stack can never grow larger than about log_base_phi(N) entries, where phi = (1+sqrt(5))/2 ~= 1.618. Thus a small # of stack slots suffice for very large arrays.

注意,通过归纳,不变量 2 意味着挂起的 run 的长度形成一个递减序列。不变量 1 意味着,从右向左读取长度,挂起 run 的长度的增长速度至少与斐波那契数字一样快。因此,堆栈的大小不可能增长到超过 log_base_phi(N) ,其中 phi = (1+sqrt(5))/2 ~= 1.618。因此,一个小的堆栈槽就能满足非常大的数组。

If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the freshness-in-cache reason), and the new run replaces the A,B or B,C entries; e.g., if the last 3 entries are

如果 A <= B+C,A 和 C 中较小的那个与 B 合并(出于高速缓存中的新鲜度原因,相等时选择 C),新的 run 将替换掉 A、B 或 B、C;例如,如果最后 3 个 run 的长度是:

A:30  B:20  C:10
  • 1

then B is merged with C, leaving

把 B、C 合并,则堆栈中剩下:

A:30  BC:30
  • 1

on the stack. Or if they were

如果最后 3 个 run 的长度是:

A:500  B:400:  C:1000
  • 1

then A is merged with B, leaving

把 A、B 合并,则堆栈中剩下:

AB:900  C:1000
  • 1

on the stack.

In both examples, the stack configuration after the merge still violates invariant #2, and merge_collapse() goes on to continue merging runs until both invariants are satisfied. As an extreme case, suppose we didn’t do the minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2, and 2. Nothing would get merged until the final 2 was seen, and that would trigger 7 perfectly balanced merges.

在这两个示例中,merge 后的堆栈配置仍然违反了不变量 2,merge_collapse() 会继续合并 run,直到两个不变量都得到满足。作为一个极端的例子,假设我们没有使用 minrun 的花招,自然 run 的长度是 128、64、32、16、8、4、2 和 2。在看到最后两个 run 之前,任何 run 都不会被合并,这将触发 7 次完美平衡的合并。

The thrust of these rules when they trigger merging is to balance the run lengths as closely as possible, while keeping a low bound on the number of runs we have to remember. This is maximally effective for random data, where all runs are likely to be of (artificially forced) length minrun, and then we get a sequence of perfectly balanced merges (with, perhaps, some oddballs at the end).

当它们触发合并时,这些规则的要旨是尽可能平衡 run 长度,同时保持我们必须记住的 run 数量尽量低。这对于随机数据是最有效的,在这些数据中,所有的 run 的长度可能都是(人为强制的)minrun,然后我们得到一个完全平衡的合并序列(最后可能还有一些奇怪的长度)。

OTOH, one reason this sort is so good for partly ordered data has to do with wildly unbalanced run lengths.

另一方面,这种排序方式对于部分有序的数据如此有效的原因之一,是与其严重不平衡的 run 长度有关。

合并的临时内存(Merge Memory)

Merging adjacent runs of lengths A and B in-place, and in linear time, is difficult. Theoretical constructions are known that can do it, but they’re too difficult and slow for practical use. But if we have temp memory equal to min(A, B), it’s easy.

在线性时间内就地合并长度为 A 和 B 的相邻 run 是困难的。众所周知,理论上的结构可以做到这一点,但是对于实际应用来说,这些结构太难太慢。但是如果我们使用大小为 min(A, B) 临时内存,那就简单了。

If A is smaller (function merge_lo), copy A to a temp array, leave B alone, and then we can do the obvious merge algorithm left to right, from the temp area and B, starting the stores into where A used to live. There’s always a free area in the original area comprising a number of elements equal to the number not yet merged from the temp array (trivially true at the start; proceed by induction). The only tricky bit is that if a comparison raises an exception, we have to remember to copy the remaining elements back in from the temp area, lest the array end up with duplicate entries from B. But that’s exactly the same thing we need to do if we reach the end of B first, so the exit code is pleasantly common to both the normal and error cases.

如果 A 比较小(调用 merge_lo 函数),将 A 复制到一个 temp 数组中,不要管 B,然后我们可以从左到右执行显而易见的合并算法,把 temp 数组和 B 存储到曾经存放 A 的位置。在原始数组中总是有一个空闲区域,该空闲区域由若干个元素组成,其数量等于尚未从 temp 数组合并的元素数量(在开始时为真,通过归纳继续)。唯一棘手的地方是,如果比较发生了异常,我们必须记住从 temp 数组复制剩余的元素,以免数组最终得到来自 B 的重复元素。但是,如果我们先到达 B 的末尾,我们需要做的事情是完全一样的,所以退出码在正常情况和错误情况下都是很普通的。

If B is smaller (function merge_hi, which is merge_lo’s “mirror image”), much the same, except that we need to merge right to left, copying B into a temp array and starting the stores at the right end of where B used to live.

如果 B 比较小(调用 merge_hi 函数,即 merge_lo 函数 的“镜像”) ,也差不多,只是我们需要从右向左合并,将 B 复制到一个 temp 数组中,并在曾经存放 B 的地方的右端开始存储。

A refinement: When we’re about to merge adjacent runs A and B, we first do a form of binary search (more on that later) to see where B[0] should end up in A. Elements in A preceding that point are already in their final positions, effectively shrinking the size of A. Likewise we also search to see where A[-1] should end up in B, and elements of B after that point can also be ignored. This cuts the amount of temp memory needed by the same amount.

一个改进:当我们要合并相邻的两个 run A 和 B 时,我们首先进行某种形式的二分查找(稍后再详细讨论) ,看看 B[0] 应该插入到 A 中的哪个位置,在此位置之前的 A 元素已经处于它们的最终位置,有效地缩短了 A 的大小。同样地,我们也会查找 A[-1] 应该插入到 B 中的哪个位置,在此位置之后的 B 元素也可以被忽略。这样可以减少合并所需的临时内存。

These preliminary searches may not pay off, and can be expected not to repay their cost if the data is random. But they can win huge in all of time, copying, and memory savings when they do pay, so this is one of the “per-merge overheads” mentioned above that we’re happy to endure because there is at most one very short run. It’s generally true in this algorithm that we’re willing to gamble a little to win a lot, even though the net expectation is negative for random data.

这些初步查找可能不会得到回报,如果数据是随机的,预计也不会得到回报。但当它们有所回报的时候,可以在总时间、复制和内存节省方面获益巨大,所以这就是前面提到过的“每次合并都有的开销”(per-merge overheads)中的一种,我们很高兴承受它,因为其成本最多也就是一个非常短的 run。在这个算法中,我们通常愿意赌一点点来赢得更多,即使净期望值对随机数据是负值。

合并的算法(Merge Algorithms)

merge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo deals with runs where A <= B, and merge_hi where A > B. They don’t know whether the data is clustered or uniform, but a lovely thing about merging is that many kinds of clustering “reveal themselves” by how many times in a row the winning merge element comes from the same run. We’ll only discuss merge_lo here; merge_hi is exactly analogous.

merge_lo()merge_hi() 是花费大量时间的地方。merge_lo 处理 A <= B 的 run,merge_hi 处理 A > B 的 run。它们不知道数据是集中的还是均匀的,但是合并的一个好处在于,集中型数据能通过 ----- 获胜的 merge 元素连续来自同一 run 的次数 ----- “暴露自己” 。这里我们只讨论 merge_lo,而 merge_hi 完全类似。

Merging begins in the usual, obvious way, comparing the first element of A to the first of B, and moving B[0] to the merge area if it’s less than A[0], else moving A[0] to the merge area. Call that the “one pair at a time” mode. The only twist here is keeping track of how many times in a row “the winner” comes from the same run.

合并以通常的、明显的方式开始,比较 A 的第一个元素和 B 的第一个元素,如果 B[0] 小于 A[0] ,则将 B[0] 移动到合并区,否则将 A[0] 移动到合并区。我们称之为“一次一对”模式。唯一的特殊之处就是记录一下有多少次“赢家”连续来自同一个 run。

If that count reaches MIN_GALLOP, we switch to “galloping mode”. Here we search B for where A[0] belongs, and move over all the B’s before that point in one chunk to the merge area, then move A[0] to the merge area. Then we search A for where B[0] belongs, and similarly move a slice of A in one chunk. Then back to searching B for where A[0] belongs, etc. We stay in galloping mode until both searches find slices to copy less than MIN_GALLOP elements long, at which point we go back to one-pair-at-a-time mode.

如果该计数达到 MIN_GALLOP,我们将切换到“飞奔模式”(galloping mode)。在这里,我们在 B 中 查找 A[0] 所属的位置,然后把此位置之前的所有 B 元素一次性移动到合并区,再把 A[0] 也移动到合并区。然后我们在 A 中查找 B[0] 所属的位置,同样地把此位置之前的所有 A 元素一次性移动到合并区,再把 B[0] 也移动到合并区。然后回到 B 中查找 A[0] 的所属位置,等等。我们继续停留在飞奔模式(galloping mode),直到两项查找发现的能一次性复制的元素数量都小于 MIN_GALLOP,此时我们又回到一次一对(one-pair-at-a-time)的模式。

A refinement: The MergeState struct contains the value of min_gallop that controls when we enter galloping mode, initialized to MIN_GALLOP. merge_lo() and merge_hi() adjust this higher when galloping isn’t paying off, and lower when it is.

一个细微的改进:MergeState 结构包含一个名为 min_gallop 的值,该值控制我们何时进入“飞奔模式”(galloping mode),它被初始化为 MIN_GALLOPmerge_lo()merge_hi() 会在没有回报时调高这个值,在有回报时调低这个值。

飞奔模式(Galloping)

Still without loss of generality, assume A is the shorter run. In galloping mode, we first look for A[0] in B. We do this via “galloping”, comparing A[0] in turn to B[0], B[1], B[3], B[7], …, B[2 ** j - 1], …, until finding the k such that B[2 ** (k-1) - 1] < A[0] <= B[2**k - 1]. This takes at most roughly lg(B) comparisons, and, unlike a straight binary search, favors finding the right spot early in B (more on that later).

仍然不失一般性,假设 A 是较短的 run。在“飞奔模式”(galloping mode)下,我们首先在 B 中寻找 A[0]。我们通过 galloping 来实现,依次将 A[0]B[0]B[1]B[3]B[7]、…、B[2**j - 1]、…,直到找到 k 使得 B[2**(k-1) - 1] < A[0] <= B[2**k - 1]。这最多需要大约 lg(B) 次比较,并且,不同于直接的二分查找,galloping 能够更早地在 B 中找到正确位置(稍后再详细介绍)。

After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1 consecutive elements, and a straight binary search requires exactly k-1 additional comparisons to nail it (see note REGION OF UNCERTAINTY). Then we copy all the B’s up to that point in one chunk, and then copy A[0]. Note that no matter where A[0] belongs in B, the combination of galloping + binary search finds it in no more than about 2*lg(B) comparisons.

在找到这样一个 k 之后,不确定区域被缩小到 2**(k-1) - 1 个连续元素,而直接的二分查找需要恰好 k-1 次额外的比较才能找到这个 k(详见 REGION OF UNCERTAINTY 注释)。然后我们把此位置之前的所有 B 元素一次性移动到合并区,再把 A[0] 也移动到合并区。注意,不管 A[0] 在 B 中属于哪个位置,galloping + 二分查找的组合都能在不超过 2*lg(B) 次比较中找到它。

If we did a straight binary search, we could find it in no more than ceiling(lg(B+1)) comparisons – but straight binary search takes that many comparisons no matter where A[0] belongs. Straight binary search thus loses to galloping unless the run is quite long, and we simply can’t guess whether it is in advance.

如果我们做一个直接的二分查找,我们会发现它的比较次数不超过 ceiling(lg(B+1)) —— 但无论 A[0] 所属的位置在哪里,直接的二分查找都需要这么多次的比较。因此直接二分查找输给了 galloping,除非 run 很长,并且我们无法简单地预先猜测。

If data is random and runs have the same length, A[0] belongs at B[0] half the time, at B[1] a quarter of the time, and so on: a consecutive winning sub-run in B of length k occurs with probability 1/2**(k+1). So long winning sub-runs are extremely unlikely in random data, and guessing that a winning sub-run is going to be long is a dangerous game.

如果数据是随机的,并且 run 的长度相同,有一半的几率 A[0] 的正确位置是 B[0] ,有四分之一的几率正确位置是 B[1] ,以此类推:在长度为 k 的 B 中,连续获胜的 sub-run 出现的概率是 1/2**(k+1)。所以在随机数据中,长期获胜的 sub-run 是极不可能的,猜测一个获胜的 sub-run 的长度很长是一个危险的游戏。

OTOH, if data is lopsided or lumpy or contains many duplicates, long stretches of winning sub-runs are very likely, and cutting the number of comparisons needed to find one from O(B) to O(log B) is a huge win.

另一方面,如果数据是不平衡的或者块状的或者包含许多重复值,那么获胜 sub-run 很可能会很长,并且把比较次数从 O(B) 减少到 O(log B) 是一个巨大的胜利。

Galloping compromises by getting out fast if there isn’t a long winning sub-run, yet finding such very efficiently when they exist.

如果没有较长的获胜 sub-run ,就会迅速退出 galloping 以妥协,但是如果存在这样的 sub-run,galloping 是找到它们的高效方法。

I first learned about the galloping strategy in a related context; see:

我第一次了解到 galloping 策略是在一个相关文献中;参见:

“Adaptive Set Intersections, Unions, and Differences” (2000) Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro

“自适应集合交集、并集和差集”(2000)作者 Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro

and its followup(s). An earlier paper called the same strategy “exponential search”:

及其跟进工作。早些时候的一篇论文称同样的策略为“指数搜索”(exponential search):

“Optimistic Sorting and Information Theoretic Complexity” Peter McIlroy SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp 467-474, Austin, Texas, 25-27 January 1993.

“乐观排序与信息论复杂性”作者 Peter McIlroy

and it probably dates back to an earlier paper by Bentley and Yao. The McIlroy paper in particular has good analysis of a mergesort that’s probably strongly related to this one in its galloping strategy.

这可能要追溯到 Bentley 和 Yao 早期的一篇论文。McIlroy 的论文特别对 mergesort 做了很好的分析,这很可能与本文的 galloping 策略密切相关。

Galloping 的局限性(Galloping with a Broken Leg)

So why don’t we always gallop? Because it can lose, on two counts:

那么,为什么我们不总是 gallop 呢? 因为它可能会失败,原因有两个:

  1. While we’re willing to endure small per-merge overheads, per-comparison overheads are a different story. Calling Yet Another Function per comparison is expensive, and gallop_left() and gallop_right() are too long-winded for sane inlining.

1.虽然我们愿意忍受每次合并的小开销,但是每次比较的开销就不一样了。为每次比较都调用另一个函数(Yet Another Function)开销很大,而且 gallop_left()gallop_right()对于正常的内联(inline)来说太冗长了。

  1. Galloping can-- alas --require more comparisons than linear one-at-time search, depending on the data.

2.根据数据的不同,galloping 可能需要比线性一次搜索更多的比较。

#2 requires details. If A[0] belongs before B[0], galloping requires 1 compare to determine that, same as linear search, except it costs more to call the gallop function. If A[0] belongs right before B[1], galloping requires 2 compares, again same as linear search. On the third compare, galloping checks A[0] against B[3], and if it’s <=, requires one more compare to determine whether A[0] belongs at B[2] or B[3]. That’s a total of 4 compares, but if A[0] does belong at B[2], linear search would have discovered that in only 3 compares, and that’s a huge loss! Really. It’s an increase of 33% in the number of compares needed, and comparisons are expensive in Python.

第 2 点需要细节。如果 A[0] 所属位置在 B[0] 之前,则 galloping 需要 1 次比较来确定,与线性搜索相同,只是调用 gallop 函数的成本更高。如果 A[0] 所属位置恰好位于 B[1] 之前,则 galloping 需要 2 次比较,同样与线性搜索相同。在第 3 次比较中,galloping 检查 A[0]B[3] ,如果结果是 <=,则需要再进行一次比较,以确定 A[0] 所属位置是位于 B[2] 还是 B[3]。这相当于总共有 4 次比较,但是如果 A[0] 确实应位于 B[2] ,线性搜索只有 3 次比较就能发现,这是一个巨大的损失!真的。需要进行的比较次数增加了 33% ,而且在 Python 中进行比较是很昂贵的。

index in B where    # compares linear  # gallop  # binary  gallop
A[0] belongs        search needs       compares  compares  total
----------------    -----------------  --------  --------  ------
               0                    1         1         0       1

               1                    2         2         0       2

               2                    3         3         1       4
               3                    4         3         1       4

               4                    5         4         2       6
               5                    6         4         2       6
               6                    7         4         2       6
               7                    8         4         2       6

               8                    9         5         3       8
               9                   10         5         3       8
              10                   11         5         3       8
              11                   12         5         3       8
                                        ...
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20

In general, if A[0] belongs at B[i], linear search requires i+1 comparisons to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons. The advantage of galloping is unbounded as i grows, but it doesn’t win at all until i=6. Before then, it loses twice (at i=2 and i=4), and ties at the other values. At and after i=6, galloping always wins.

一般来说,如果 A[0] 所属位置是位于 B[i] ,线性查找需要 i+1 次比较才能确定,而 galloping 总共需要 2*floor(lg(i))+2 次比较。galloping 的好处是随着 i 的增大而不受限制,但是直到 i = 6 它才能胜过线性查找。在此之前,它有两次不如线性查找(在 i = 2i = 4 处),在 i = 其他值时与线性查找打成平手。在 i = 6 时及以后,galloping 总是获胜。

We can’t guess in advance when it’s going to win, though, so we do one pair at a time until the evidence seems strong that galloping may pay. MIN_GALLOP is 7, and that’s pretty strong evidence. However, if the data is random, it simply will trigger galloping mode purely by luck every now and again, and it’s quite likely to hit one of the losing cases next. On the other hand, in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it “should be” then. So the MergeState struct keeps a min_gallop variable that merge_lo and merge_hi adjust: the longer we stay in galloping mode, the smaller min_gallop gets, making it easier to transition back to galloping mode (if we ever leave it in the current merge, and at the start of the next merge). But whenever the gallop loop doesn’t pay, min_gallop is increased by one, making it harder to transition back to galloping mode (and again both within a merge and across merges). For random data, this all but eliminates the gallop penalty: min_gallop grows large enough that we almost never get into galloping mode. And for cases like ~sort, min_gallop can fall to as low as 1. This seems to work well, but in all it’s a minor improvement over using a fixed MIN_GALLOP value.

我们不能预先猜测它什么时候会赢,所以我们先采用一次一对的模式(one pair at a time),直到有足够的证据证明 galloping 可能会赢。MIN_GALLOP 等于 7,这是相当有力的证据。然而,如果数据是随机的,那么它就会时不时地纯粹靠运气触发 galloping 模式,而且很有可能接下来会碰上一个失败的案例。另一方面,在像 ~sort 这样的情况下,galloping 总是有回报的,而且 MIN_GALLOP 比它“应该”的要大。因此 MergeState 结构保留了一个在 merge_lomerge_hi 中可以调整的 min_gallop 变量:我们在 galloping 模式下停留的时间越长, min_gallop 就越小,这使得我们更容易转换回 galloping 模式(如果我们在当前的合并中离开了此模式,在下一次合并时开始)。但是每当 galloping 循环不起作用时,min_gallop 就会加 1,这使得转换回 galloping 模式变得更加困难(在一次合并和跨合并中也是如此)。对于随机数据,这几乎消除了 galloping 的代价:min_gallop 增长到足够大,以至于我们几乎永远不会进入 galloping 模式。对于像 ~sort 这样的情况, min_gallop 可以降到 1。这看起来工作得很好,但是总的来说,这是一个比使用固定的 MIN_GALLOP 值稍微有所改进的地方。

Galloping 面临的难题(Galloping Complication)

The description above was for merge_lo. merge_hi has to merge “from the other end”, and really needs to gallop starting at the last element in a run instead of the first. Galloping from the first still works, but does more comparisons than it should (this is significant – I timed it both ways). For this reason, the gallop_left() and gallop_right() (see note LEFT OR RIGHT) functions have a “hint” argument, which is the index at which galloping should begin. So galloping can actually start at any index, and proceed at offsets of 1, 3, 7, 15, … or -1, -3, -7, -15, … from the starting index.

上面的描述是关于 merge_lo 的。merge_hi 必须从另一端(尾端/右端)开始合并,并且必须从 run 的最后一个元素开始合并而不是第一个元素。从第一个元素开始 galloping 仍然是有用的,但是其比较次数会比本应的次数更多(这很重要 —— 我以两种方式计算了时间)。由于这个原因,gallop_left()gallop_right() 函数(参见 LEFT OR RIGHT 注释)有一个 hint 参数,这是 galloping 应该开始的位置。所以实际上,galloping 可以从任何位置开始,并从这个开始位置以偏移量 1、3、7、15、… 或者-1、-3、-7、-15、… 进行。

In the code as I type it’s always called with either 0 or n-1 (where n is the # of elements in a run). It’s tempting to try to do something fancier, melding galloping with some form of interpolation search; for example, if we’re merging a run of length 1 with a run of length 10000, index 5000 is probably a better guess at the final result than either 0 or 9999. But it’s unclear how to generalize that intuition usefully, and merging of wildly unbalanced runs already enjoys excellent performance.

在我写的代码中,hint 总是使用 0n-1 来调用(其中 n 是 run 的元素数量)。尝试做一些更新奇的事情是很有诱惑力的,融合 galloping 和某种形式的内插查找(interpolation search);例如,如果我们合并长度为 1 和长度为 10000 的 run,从 5000 开始猜测最终结果可能是比 0 或 9999 更好。但是如何有效地归纳这种直觉尚不清楚,而且合并那些极不平衡的 run 已经获得了卓越的性能。

~sort is a good example of when balanced runs could benefit from a better hint value: to the extent possible, this would like to use a starting offset equal to the previous value of acount/bcount. Doing so saves about 10% of the compares in ~sort. However, doing so is also a mixed bag, hurting other cases.

~sort 是一个很好的例子,说明了平衡的 run 可以从更好的 hint 值中获益:在可能的范围内,这将使用与前面的值 acount/bcount 相等的起始偏移量。这样做可以在 ~sort 中节省大约 10% 的比较。然而,这样做也是好坏参半,损伤了其他情况下的性能。

比较两种算法在随机数组中的平均比较次数(Comparing Average # of Compares on Random Arrays)

[NOTE: This was done when the new algorithm used about 0.1% more compares on random data than does its current incarnation.]

[注意: 这是在新算法(timsort)对于随机数据使用的比较次数比其当前版本多 0.1% 时完成的。]

Here list.sort() is samplesort, and list.msort() this sort:

这里 list.sort() 是指 samplesort,而 list.msort() 是指新算法:

import random
from time import clock as now

def fill(n):
    from random import random
    return [random() for i in range(n)]

def mycmp(x, y):
    global ncmp
    ncmp += 1
    return cmp(x, y)

def timeit(values, method):
    global ncmp
    X = values[:]
    bound = getattr(X, method)
    ncmp = 0
    t1 = now()
    bound(mycmp)
    t2 = now()
    return t2-t1, ncmp

format = "%5s  %9.2f  %11d"
f2     = "%5s  %9.2f  %11.2f"

def drive():
    count = sst = sscmp = mst = mscmp = nelts = 0
    while True:
        n = random.randrange(100000)
        nelts += n
        x = fill(n)

        t, c = timeit(x, 'sort')
        sst += t
        sscmp += c

        t, c = timeit(x, 'msort')
        mst += t
        mscmp += c

        count += 1
        if count % 10:
            continue

        print "count", count, "nelts", nelts
        print format % ("sort",  sst, sscmp)
        print format % ("msort", mst, mscmp)
        print f2     % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp)

drive()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50

I ran this on Windows and kept using the computer lightly while it was running. time.clock() is wall-clock time on Windows, with better than microsecond resolution. samplesort started with a 1.52% #-of-comparisons disadvantage, fell quickly to 1.48%, and then fluctuated within that small range. Here’s the last chunk of output before I killed the job:

我在 Windows 上运行这段代码,并在运行它时只轻微使用了这台电脑。 time.clock() 是 Windows 的墙上时钟时间,分辨率高于微秒。samplesort 开始时的比较次数劣势为 1.52% ,很快下降到 1.48% ,然后在这个小范围内波动。下面是我终止这段代码前的最后一块输出:

count 2630 nelts 130906543
 sort    6110.80   1937887573
msort    6002.78   1909389381
            1.80         1.49
  • 1
  • 2
  • 3
  • 4

We’ve done nearly 2 billion comparisons apiece at Python speed there, and that’s enough .

我们以 Python 的速度分别做了将近20亿次的比较,这已经足够了<眨眼>。

For random arrays of size 2 (yes, there are only 2 interesting ones), samplesort has a 50%(!) comparison disadvantage. This is a consequence of samplesort special-casing at most one ascending run at the start, then falling back to the general case if it doesn’t find an ascending run immediately. The consequence is that it ends up using two compares to sort [2, 1]. Gratifyingly, timsort doesn’t do any special-casing, so had to be taught how to deal with mixtures of ascending and descending runs efficiently in all cases.

对于大小为 2 的随机数组(是的,只有2个有趣的元素),samplesort 有 50%(!) 的比较劣势。这是因为 samplesort 在开始时至多把一个升序 run 划为特殊案例,如果它没有发现一个升序 run 就会立即返回到普通案例(general case)。结果是它最终使用了两次比较来排序[2,1]。令人欣慰的是,timsort 没有任何的特殊案例,所以必须教会它如何在所有情况下有效地处理混合了升序和降序的 run。

注释(NOTES)

二分插入排序(BINSORT)

A “binary insertion sort” is just like a textbook insertion sort, but instead of locating the correct position of the next item via linear (one at a time) search, an equivalent to Python’s bisect.bisect_right is used to find the correct position in logarithmic time. Most texts don’t mention this variation, and those that do usually say it’s not worth the bother: insertion sort remains quadratic (expected and worst cases) either way. Speeding the search doesn’t reduce the quadratic data movement costs.

“二分插入排序”就像教科书中的插入排序,但不是通过线性(一次一个)查找来定位下一个元素的正确位置,而是使用了相当于 Python’s bisect.bisect_right 的查找方法在对数时间内找到正确位置。大多数文献都没有提到这种变化,而那些提到这种变化的文献通常认为不值得这么麻烦:插入排序无论如何都是二次方的(平均情况和最坏情况)。加速查找过程并不能降低二次方的数据移动成本。

But in CPython’s case, comparisons are extraordinarily expensive compared to moving data, and the details matter. Moving objects is just copying pointers. Comparisons can be arbitrarily expensive (can invoke arbitrary user-supplied Python code), but even in simple cases (like 3 < 4) all decisions are made at runtime: what’s the type of the left comparand? the type of the right? do they need to be coerced to a common type? where’s the code to compare these types? And so on. Even the simplest Python comparison triggers a large pile of C-level pointer dereferences, conditionals, and function calls.

但是在 CPython 中,与移动数据相比,比较是非常昂贵的,而且细节很重要。移动对象只是复制指针。比较可能是任意昂贵的(可以调用任意用户提供的 Python 代码),但即使在简单的情况下(例如3 < 4所有决策都是在运行时做出的:左比较类型是什么?右比较的类型?他们需要被强迫成为一种普通类型吗?比较这些类型的代码在哪里?诸如此类。即使是最简单的 Python 比较也会触发大量的 C-level 指针解引用、条件和函数调用。

So cutting the number of compares is almost always measurably helpful in CPython, and the savings swamp the quadratic-time data movement costs for reasonable minrun values.

因此,在 CPython 中,减少比较次数几乎总是有明显的帮助,节省下来的成本可以抵消二次方时间的数据移动成本,从而获得合理的 minrun 值。

左或右(LEFT OR RIGHT)

gallop_left() and gallop_right() are akin to the Python bisect module’s bisect_left() and bisect_right(): they’re the same unless the slice they’re searching contains a (at least one) value equal to the value being searched for. In that case, gallop_left() returns the position immediately before the leftmost equal value, and gallop_right() the position immediately after the rightmost equal value. The distinction is needed to preserve stability. In general, when merging adjacent runs A and B, gallop_left is used to search thru B for where an element from A belongs, and gallop_right to search thru A for where an element from B belongs.

gallop_left()gallop_right() 类似于 Python bisect 模块的 bisect_left()bisect_right():它们是相同的,除非它们正在搜索的区域包含一个(至少一个)与正在搜索的值相等的值。在这种情况下,gallop_left()返回紧靠最左等值之前的位置,而 gallop_right() 返回紧靠最右等值之后的位置。为了保持排序的稳定,这种区别是必要的。一般来说,当合并相邻的run A 和 B 时,gallop_left 用于在 B 中搜索来自 A 的元素所属的位置,gallop_right 用于在 A 中搜索来自 B 的元素所属的位置。

不确定区域(REGION OF UNCERTAINTY)

Two kinds of confusion seem to be common about the claim that after finding a k such that

有两种混淆似乎很常见,关于找到满足以下条件的 k 的说明

B[2**(k-1) - 1] < A[0] <= B[2**k - 1]
  • 1

then a binary search requires exactly k-1 tries to find A[0]'s proper location. For concreteness, say k=3, so B[3] < A[0] <= B[7].

二分查找需要 k-1 次尝试才能找到 A[0] 的正确位置。具体一点,比如说 k=3,因此 B[3] < A[0] <= B[7]

The first confusion takes the form “OK, then the region of uncertainty is at indices 3, 4, 5, 6 and 7: that’s 5 elements, not the claimed 2**(k-1) - 1 = 3”; or the region is viewed as a Python slice and the objection is “but that’s the slice B[3:7], so has 7-3 = 4 elements”. Resolution: we’ve already compared A[0] against B[3] and against B[7], so A[0]'s correct location is already known wrt both endpoints. What remains is to find A[0]'s correct location wrt B[4], B[5] and B[6], which spans 3 elements. Or in general, the slice (leaving off both endpoints) (2**(k-1)-1)+1 through (2k-1)-1 inclusive = 2(k-1) through (2**k-1)-1 inclusive, which has

第一种混淆的形式是“ OK,那么不确定区域在下标 3、4、5、6 和 7处:这是5个元素,不是声称的2**(k-1) - 1 = 3”; 或者这个区域被视为一个 Python 切片,反对意见是“但这是切片 B[3:7] ,所以有 7-3 = 4 个元素”。解释:我们已经将 A[0]B[3] 以及 B[7] 进行了比较,所以 A[0] 的正确位置已经知道了两个端点。剩下的就是在 B[4]B[5]B[6] 中找到 A[0] 的正确位置关于 ,它们跨越了 3 个元素。或者一般来说,切片(去掉两个端点)(2**(k-1)-1)+1(2**k-1)-1 包含 = 2**(k-1)(2**k-1)-1,即以下数量的元素

(2**k-1)-1 - 2**(k-1) + 1 =
2**k-1 - 2**(k-1) =
2*2**(k-1)-1 - 2**(k-1) =
(2-1)*2**(k-1) - 1 =
2**(k-1) - 1
  • 1
  • 2
  • 3
  • 4
  • 5

elements.

The second confusion: “k-1 = 2 binary searches can find the correct location among 2**(k-1) = 4 elements, but you’re only applying it to 3 elements: we could make this more efficient by arranging for the region of uncertainty to span 2**(k-1) elements.” Resolution: that confuses “elements” with “locations”. In a slice with N elements, there are N+1 locations. In the example, with the region of uncertainty B[4], B[5], B[6], there are 4 locations: before B[4], between B[4] and B[5], between B[5] and B[6], and after B[6]. In general, across 2**(k-1)-1 elements, there are 2**(k-1) locations. That’s why k-1 binary searches are necessary and sufficient.

第二个混淆:“ k-1 = 2 次二分查找可以在 2**(k-1) = 4 个元素中找到正确的位置,但是你只将它应用于 3 个元素:我们可以通过将不确定区域设置为跨越 2**(k-1) 个元素来提高效率。” 解释:混淆“元素”和“位置”。在一个有 N 个元素的切片中,有 N+1位置。在不确定区域 B[4]B[5]B[6]中,有 4 个位置:B[4] 之前,B[4]B[5]之间,B[5]B[6]之间,B[6]之后。通常,跨越 2**(k-1)-1 个元素,有 2**(k-1) 个位置。这就是为什么 k-1 次二分查找是必要的和充分的。

对单次比较的优化(OPTIMIZATION OF INDIVIDUAL COMPARISONS)

As noted above, even the simplest Python comparison triggers a large pile of C-level pointer dereferences, conditionals, and function calls. This can be partially mitigated by pre-scanning the data to determine whether the data is homogenous with respect to type. If so, it is sometimes possible to substitute faster type-specific comparisons for the slower, generic PyObject_RichCompareBool.

如上所述,即使是最简单的 Python 比较也会触发大量的 C-level 指针解引用、条件和函数调用。通过预先扫描数据以确定数据在类型方面是否同质,可以部分缓解这一问题。如果是这样的话,有时可以用更快的类型特定的比较来替代较慢的、普通的 PyObject_RichCompareBool.

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Monodyee/article/detail/85164
推荐阅读
相关标签
  

闽ICP备14008679号