在2024 Clojure 状态调查！中分享您的想法。

Question

在数值基准测试中，使用双精度浮点数可以缓解奇数的性能惩罚...

提问 Jan 10, 2023 在 Java 互操作由 Tom

我在这里偶然找到性能基准测试，我很想知道为什么 Clojure 的性能比 Java 差。

所以我将其放入分析器（修改了他们的版本以使用未经检查的数学运算 - 这并没有帮助）并且没有显示出来。嗯。反编译并发现

// Decompiling class: leibniz$calc_pi_leibniz
import clojure.lang.*;

public final class leibniz$calc_pi_leibniz extends AFunction implements LD
{
    public static double invokeStatic(final long rounds) {
        final long end = 2L + rounds;
        long i = 2L;
        double x = 1.0;
        double pi = 1.0;
        while (i != end) {
            final double x2 = -x;
            final long n = i + 1L;
            final double n2 = x2;
            pi += Numbers.divide(x2, 2L * i - 1L);
            x = n2;
            i = n;
        }
        return Numbers.unchecked_multiply(4L, pi);
    }

    @Override
    public Object invoke(final Object o) {
        return invokeStatic(RT.uncheckedLongCast(o));
    }

    @Override
    public final double invokePrim(final long rounds) {
        return invokeStatic(rounds);
    }
}

所以看起来 double/long 边界至少导致了方法查找的成本，可能在 Numbers.divide 中？
所以我只是将一切强制转换为 double（甚至我们的索引变量）

(def rounds 100000000)

(defn calc-pi-leibniz2
  "Eliminate mixing of long/double to avoid clojure.numbers invocations."
  ^double
  [^long rounds]
  (let [end (+ 2.0 rounds)]
    (loop [i 2.0 x 1.0 pi 1.0]
      (if (= i end)
        (* 4.0 pi)
        (let [x (- x)]
          (recur (inc i) x (+ pi (/ x (dec (* 2 i))))))))))

leibniz=> (c/quick-bench (calc-pi-leibniz rounds))
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 575.352216 ms
    Execution time std-deviation : 10.070268 ms
   Execution time lower quantile : 566.210399 ms ( 2.5%)
   Execution time upper quantile : 588.772187 ms (97.5%)
                   Overhead used : 1.884700 ns
nil
leibniz=> (c/quick-bench (calc-pi-leibniz2 rounds))
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 158.509049 ms
    Execution time std-deviation : 759.113165 ╡s
   Execution time lower quantile : 157.234899 ms ( 2.5%)
   Execution time upper quantile : 159.205374 ms (97.5%)
                   Overhead used : 1.884700 ns
nil

有人知道为什么Java 实现在进行除法运算时不会产生相同的惩罚吗？[两个版本都使用 unchecked-math 来实现：warn-on-boxed]。

我还尝试了一个使用 fastmath 原始数学运算符的变体，实际上速度更慢。到目前为止，还没有任何方法能打败将循环索引 i 强制转换为 double（这我通常不会这样做）。

评论 Jan 10, 2023 由 Ben Sless

在我的基准测试中，这种方法在不需要强制将索引转换为 double 的情况下提供了与你双精度解决方案相同的表现。

(defn calc-pi-leibniz3
  "消除长/双精度数值混合，以避免 clojure.numbers 调用。"
  ^double
  [^long rounds]
  (let [end (+ 2 rounds)]
    (loop [i 2 x 1.0 pi 1.0]
      (if (= i end)
        (* 4.0 pi)
        (let [x (- x)]
          (recur (inc i) x (+ pi (/ x (double (unchecked-dec-int (unchecked-multiply-int (unchecked-int 2) (unchecked-int i)))))))))))))

这关乎于意识到每个编译器插入的类型强制转换指令的位置。

这是 Java 方案的字节码。

  public static double go(int);
    描述符：(I)D
    标志：(0x0009) ACC_PUBLIC, ACC_STATIC
    代码
      栈容量=6, 局部变量=6, 参数大小=1
         0: dconst_1
         1: dstore_1
         2: dconst_1
         3: dstore_3
         4: iconst_2
         5: istore        5
         7: iload         5
        9: iload_0
       10: iconst_2
       11: iadd
       12: if_icmpge     39
       15: dload_3
       16: ldc2_w        #24                 // 双精度浮点数 -1.0d
       19: dmul
       20: dstore_3
       21: dload_1
       22: dload_3
       23: iconst_2
       24: iload         5
       26: imul
       27: iconst_1
       28: isub
       29: i2d
       30: ddiv
       31: dadd
       32: dstore_1
       33: iinc         5, 1
       36: goto          7
       39: dload_1
       40: ldc2_w        #26                 // 双精度浮点数 4.0d
       43: dmul
       44: dup2
       45: dstore_1
       46: dreturn

这里是为您提供的解决方案

    public static double invokeStatic(long rounds);
        标志：PUBLIC, STATIC
        代码
               0: ldc2_w          2.0
               3: lload_0         /* rounds */
               4: invokestatic    clojure/lang/Numbers.add:(DJ)D
               7: dstore_2        /* end */
               8: ldc2_w          2.0
             11: dstore          i
             13: dconst_1
             14: dstore          x
             16: dconst_1
             17: dstore          pi
             19: dload           i
             21: dload_2         /* end */
             22: dcmpl
             23: ifne            36
             26: ldc2_w          4.0
             29: dload           pi
             31: dmul
             32: goto            72
             35: athrow
              36: 载入d x
              38: 取反d
              39: 存储d x
              41: 载入d i
              43: 常量d 1
              44: 加d
              45: 载入d x
              47: 载入d pi
              49: 载入d x
              51: 载入双精度浮点数 2
              54: 载入d i
              56: 调用静态 clojure/lang/Numbers 乘法：(JD)D
              59: 常量d 1
              60: 减d
              61: 除d
              62: 加d
              63: 存储d pi
              65: 存储d x
              67: 存储d i
              69: 跳转            19
              72: 返回d

我的解决方案

    public static double invokeStatic(long rounds);
        标志：PUBLIC, STATIC
        代码
               0: 载入双精度浮点数 2
               3: lload_0         /* rounds */
               4: 加l
               5: 存储操作数l 2 /* end */
               6: 载入双精度浮点数 2
               9: 存储操作数l i
              11: 常量d 1
              12: 存储d x
              14: 常量d 1
              15: 存储d pi
              17: 载入l i
              19: 载入操作数l 2 /* end */
              20: 比较l
              21: 如果不等于跳转       34
              24: 载入双精度浮点数 4.0
              27: 载入d pi
              29: 乘d
              30: 跳转       71
              33: 抛出异常
              34: 载入d x
              36: 取反d
              37: 存储d x
              39: 载入l i
              41: 常量l 1
              42: 加l
              43: 载入d x
              45: 载入d pi
              47: 载入d x
              49: 将引用类型加载到本地变量表的word区域 2
              53: 从long类型的本地变量加载 i
              55: 将long类型转换为int类型
              56: 进行整数乘法
              57: 将int常量1加载到局部变量表
              58: 进行整数减法
              59: 将int类型转换为double类型
              60: 进行double类型除法
              61: 进行double类型加法
              62: 将double类型的值存储到局部变量表
              64: 将double类型的值存储到局部变量表
              66: 从long类型的本地变量加载 i
              68: 无条件跳转到程序的第17行
              71: 返回double类型的值

它还避免了所有方法调用，并且直接与原始数据类型操作

不过，尽管有这么多 manipulation，其性能仍然不如你的解决方案

根据我的基准测试，这两个解决方案的性能与Java版相同

评论 Jan 10, 2023 by Tom

评论 Jan 12, 2023 by Chris Nuernberger

1 个答案

alexmiller · Answer 1 · 2023-01-10T18:46:08+0000

我一段时间内没有时间来看这个，但很容易在循环/递归边界处陷入僵化，这会导致大量的降速，但最简单的方法是查看字节码来确认。

在2024 Clojure 状态调查！中分享您的想法。

在数值基准测试中，使用双精度浮点数可以缓解奇数的性能惩罚...

请登录或注册来发表评论。

请登录或注册来回答这个问题。

1 个答案

请登录或注册来发表评论。

分类

在2024 Clojure 状态调查！中分享您的想法。

在数值基准测试中，使用双精度浮点数可以缓解奇数的性能惩罚...

请登录 或 注册 来发表评论。

请登录 或 注册 来回答这个问题。

1 个答案

请登录 或 注册 来发表评论。

相关问题

分类

请登录或注册来发表评论。

请登录或注册来回答这个问题。

请登录或注册来发表评论。