#:g1: SBCLにcaseのジャンプテーブル最適化が入ったので試してみる

Posted 2019-11-27 20:02:15 GMT

昨日リリースされたSBCL 1.5.9にcaseのジャンプテーブル最適化が入ったようなので早速どんなものか試してみたいと思います。

とりあえず若干わざとらしいものを試してみます。
caseのキーに0から511までの数値をシャッフルしたものを指定して分岐し、さらに二段目のcaseで元に戻すのを5回繰り返すのを1000繰り返してみます。

(defconstant nbranch 512)

;; alexandria (defun shuffle (sequence &key (start 0) end) "Returns a random permutation of SEQUENCE bounded by START and END. Original sequece may be destructively modified, and share storage with the original one. Signals an error if SEQUENCE is not a proper sequence." (declare (type fixnum start) (type (or fixnum null) end)) (etypecase sequence (list (let* ((end (or end (length sequence))) (n (- end start))) (do ((tail (nthcdr start sequence) (cdr tail))) ((zerop n)) (rotatef (car tail) (car (nthcdr (random n) tail))) (decf n)))) (vector (let ((end (or end (length sequence)))) (loop for i from start below end do (rotatef (aref sequence i) (aref sequence (+ i (random (- end i)))))))) (sequence (let ((end (or end (length sequence)))) (loop for i from (- end 1) downto start do (rotatef (elt sequence i) (elt sequence (+ i (random (- end i))))))))) sequence)

(defmacro casetabletest (x) (let ((xy (loop :for x :across (shuffle (let ((vec (make-sequence 'vector nbranch))) (dotimes (i nbranch vec) (setf (elt vec i) i)))) :for i :from 0 :collect (list i x)))) `(case (case ,x ,@xy (otherwise -1)) ,@(mapcar #'reverse xy) (otherwise -1))))

(defun casetest (&aux (n 0)) (dotimes (i nbranch n) (incf n (casetabletest (casetabletest (casetabletest (casetabletest (casetabletest i))))))))

(compile 'casetest)

(time (dotimes (i 1000) (casetest)))

SBCL 1.5.8

t% /l/sbcl/1.5.8/bin/sbcl --no-sysinit --no-userinit --load /tmp/case.lisp --quit 
This is SBCL 1.5.8, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.

SBCL is free software, provided as is, with absolutely no warranty. It is mostly in the public domain; some portions are provided under BSD-style licenses. See the CREDITS and COPYING files in the distribution for more information. Evaluation took: 1.986 seconds of real time 1.990000 seconds of total run time (1.990000 user, 0.000000 system) 100.20% CPU 6,537,459,720 processor cycles 0 bytes consed

SBCL 1.5.9

t% /l/sbcl/1.5.9/bin/sbcl --no-sysinit --no-userinit --load /tmp/case.lisp --quit 
This is SBCL 1.5.9, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.

SBCL is free software, provided as is, with absolutely no warranty. It is mostly in the public domain; some portions are provided under BSD-style licenses. See the CREDITS and COPYING files in the distribution for more information. Evaluation took: 0.056 seconds of real time 0.060000 seconds of total run time (0.060000 user, 0.000000 system) 107.14% CPU 184,341,012 processor cycles 0 bytes consed

この極端な例では35倍も速くなっています。
まあこんなことはそうそうないですが!

ちなみに類似の最適化を実施するClozure CLでも同じ位のスピードが出るようです。

t% /l/ccl/1.11.5/lx86cl64 -n -l /tmp/case.lisp -e '(quit)'
(DOTIMES (I 1000) (CASETEST))
took 55,783 microseconds (0.055783 seconds) to run.
During that period, and with 8 available CPU cores,
     60,000 microseconds (0.060000 seconds) were spent in user mode
          0 microseconds (0.000000 seconds) were spent in system mode

発動ルールを探る

上記の例では最適化が発動しましたが、caseのジャンプテーブル化ではそんなに大きなテーブルは作らないことがほとんどなので、SBCLではどういう縛りがあるか確認してみます。

発動ルールは、src/compiler/ir2opt.lispshould-use-jump-table-pの中に記述されているようで、

のようです。

(defun should-use-jump-table-p (chain &aux (choices (car chain)))
  ;; Dup keys could exist. REMOVE-DUPLICATES from-end can handle that:
  ;;  "the one occurring earlier in sequence is discarded, unless from-end
  ;;   is true, in which case the one later in sequence is discarded."
  (let ((choices (remove-duplicates choices :key #'car :from-end t))) 
    ;; Convert to multiway only if at least 4 key comparisons would be needed.
    (unless (>= (length choices) 4)
      (return-from should-use-jump-table-p nil))
    (let ((values (mapcar #'car choices)))
      (cond ((every #'fixnump values)) ; ok
            ((every #'characterp values)   
             (setq values (mapcar #'sb-xc:char-code values)))
            (t
             (return-from should-use-jump-table-p nil)))
      (let* ((min (reduce #'min values))
             (max (reduce #'max values))
             (table-size (1+ (- max min )))
             (size-limit (* (length values) 2)))
        ;; Don't waste too much space, e.g. {5,6,10,20} would require 16 words
        ;; for 4 entries, which is excessive.
        (when (and (<= table-size size-limit)
                   (can-encode-jump-table-p min max))
          ;; Return the new choices
          (cons choices (cdr chain)))))))

上記ルールからすると、一つ置きで配置された整数のキーは最適化されますが、二つ置きだとルールから外れるので最適化されないことが分かります。
一応試してみましょう。

(defun foo2 (x)
  (declare (type fixnum x))
  #.`(case x
       ,@(loop :for i :from 0 :by 2 :repeat 10
               :collect (list i i))
       (otherwise -1)))

; disassembly for FOO2
; Size: 110 bytes. Origin: #x52DF52DA                         ; FOO2
; 2DA:       498B4510         MOV RAX, [R13+16]               ; thread.binding-stack-pointer
; 2DE:       488945F8         MOV [RBP-8], RAX
; 2E2:       4C8BDB           MOV R11, RBX
; 2E5:       4983FB24         CMP R11, 36
; 2E9:       774E             JNBE L10
; 2EB:       488D0526FFFFFF   LEA RAX, [RIP-218]              ; = #x52DF5218
; 2F2:       42FF2498         JMP QWORD PTR [RAX+R11*4]
; 2F6: L0:   BA04000000       MOV EDX, 4
; 2FB: L1:   488BE5           MOV RSP, RBP
; 2FE:       F8               CLC
; 2FF:       5D               POP RBP
; 300:       C3               RET
; 301: L2:   BA08000000       MOV EDX, #x8                    ; is_lisp_thread
; 306:       EBF3             JMP L1
; 308: L3:   BA0C000000       MOV EDX, 12
; 30D:       EBEC             JMP L1
; 30F: L4:   BA10000000       MOV EDX, 16
; 314:       EBE5             JMP L1
; 316: L5:   BA14000000       MOV EDX, 20
; 31B:       EBDE             JMP L1
; 31D: L6:   BA18000000       MOV EDX, 24
; 322:       EBD7             JMP L1
; 324: L7:   BA1C000000       MOV EDX, 28
; 329:       EBD0             JMP L1
; 32B: L8:   BA20000000       MOV EDX, 32
; 330:       EBC9             JMP L1
; 332: L9:   BA24000000       MOV EDX, 36
; 337:       EBC2             JMP L1
; 339: L10:  48C7C2FEFFFFFF   MOV RDX, -2
; 340:       EBB9             JMP L1
; 342: L11:  31D2             XOR EDX, EDX
; 344:       EBB5             JMP L1
; 346:       CC10             INT3 16                         ; Invalid argument count trap

(defun foo3 (x)
  (declare (type fixnum x))
  #.`(case x
       ,@(loop :for i :from 0 :by 3 :repeat 10
               :collect (list i i))
       (otherwise -1)))

; disassembly for FOO3
; Size: 154 bytes. Origin: #x52DF53CE                         ; FOO3
; 3CE:       498B5D10         MOV RBX, [R13+16]               ; thread.binding-stack-pointer
; 3D2:       48895DF8         MOV [RBP-8], RBX
; 3D6:       4885C0           TEST RAX, RAX
; 3D9:       0F8483000000     JEQ L9
; 3DF:       4883F806         CMP RAX, 6
; 3E3:       750B             JNE L1
; 3E5:       BA06000000       MOV EDX, 6
; 3EA: L0:   488BE5           MOV RSP, RBP
; 3ED:       F8               CLC
; 3EE:       5D               POP RBP
; 3EF:       C3               RET
; 3F0: L1:   4883F80C         CMP RAX, 12
; 3F4:       7507             JNE L2
; 3F6:       BA0C000000       MOV EDX, 12
; 3FB:       EBED             JMP L0
; 3FD: L2:   4883F812         CMP RAX, 18
; 401:       7507             JNE L3
; 403:       BA12000000       MOV EDX, 18
; 408:       EBE0             JMP L0
; 40A: L3:   4883F818         CMP RAX, 24
; 40E:       7507             JNE L4
; 410:       BA18000000       MOV EDX, 24
; 415:       EBD3             JMP L0
; 417: L4:   4883F81E         CMP RAX, 30
; 41B:       7507             JNE L5
; 41D:       BA1E000000       MOV EDX, 30
; 422:       EBC6             JMP L0
; 424: L5:   4883F824         CMP RAX, 36
; 428:       7507             JNE L6
; 42A:       BA24000000       MOV EDX, 36
; 42F:       EBB9             JMP L0
; 431: L6:   4883F82A         CMP RAX, 42
; 435:       7507             JNE L7
; 437:       BA2A000000       MOV EDX, 42
; 43C:       EBAC             JMP L0
; 43E: L7:   4883F830         CMP RAX, 48
; 442:       7507             JNE L8
; 444:       BA30000000       MOV EDX, 48
; 449:       EB9F             JMP L0
; 44B: L8:   4883F836         CMP RAX, 54
; 44F:       48C7C2FEFFFFFF   MOV RDX, -2
; 456:       41BB36000000     MOV R11D, 54
; 45C:       490F44D3         CMOVEQ RDX, R11
; 460:       EB88             JMP L0
; 462: L9:   31D2             XOR EDX, EDX
; 464:       EB84             JMP L0
; 466:       CC10             INT3 16                         ; Invalid argument count trap

まとめ

SBCLのcaseのジャンプテーブル化は、キーをそこそこ密に配置する必要がある様子。
ちなみに、caseの最適化と本記事では書いてきましたが、Clozure CLと同じく、コンパイラが最適化で実施するので、Lispのレベルではifの組み合わせが最適化のルールに合致していれば発動します。

SBCLには最近細かい最適化が入ってきていますが今後も地味に速くなって行きそうです。

関連記事


HTML generated by 3bmd in LispWorks 7.0.0

comments powered by Disqus