Groups > comp.sys.apple2.programmer > #378

Re: Fastest method to copy/process a range of bytes?

From	Egan Ford <datajerk@gmail.com>
Newsgroups	comp.sys.apple2.programmer
Subject	Re: Fastest method to copy/process a range of bytes?
Date	2012-08-13 11:18 -0600
Organization	XMission http://xmission.com/
Message-ID	<k0bcue$8ei$1@news.xmission.com> (permalink)
References	<jvs69d$qa1$1@news.xmission.com> <mdGdnWkVEqOuvb7NnZ2dnUVZ_vednZ2d@earthlink.com> <k08g78$tkd$1@news.xmission.com> <9sudnYzk1OuAH7XNnZ2dnUVZ_q2dnZ2d@earthlink.com>

Show all headers | View raw

On 8/12/12 10:27 PM, Anton Treuenfels wrote:
> Ah, so. What comes of coding off the top of my head. Alas I don't
> believe your fix works in all cases. Consider the case where length is
> zero; the INX is meant to account for this. Failing to do that will give
> you a whole lot of moves you don't want. Although I suppose if you can
> guarantee length will never be zero it wouldn't be a problem and your
> fix will work.

Length will never be zero.  The smallest it can be with this program is $2C.

On 8/12/12 10:27 PM, Anton Treuenfels wrote:
 > As mentioned earlier by MJM, you can also dispense with the X register
 > altogether if you're willing to simply INC and DEC 'length+1' directly.
 > It's only a 3-cycle penalty once per page

Yep, that is exactly what I did in cases where I needed x in an inner 
loop (see div16_mp below).

Here is an example of a reverse loop starting at arraylen-1 (arrayend is 
precomputed) and ending at ptr, unrolled 4x with Duff:

add_mp:
         sta     ptr             ; store ptr lo from A
         tya
         clc
         adc     arrayend+1      ; add number of pages since we have to
         sta     ptr+1           ;   go backwards for add/sub/asl

         lda     ptr_mp+1
         clc
         adc     arrayend+1      ; add number of pages since we have to
         sta     ptr_mp+1        ;   go backwards for add/sub/asl

         ldx     arrayend+1      ; full pages
         ldy     arrayend        ; partial
         clc
         tya
         and     #3
         cmp     #3
         beq     @3
         cmp     #2
         beq     @2
         cmp     #1
         beq     @1
         bne     @0
:       dex
         dec     ptr+1           ; previous page of 256
         dec     ptr_mp+1        ; previous page of 256
:       dey
@3:     lda     (ptr),y
         adc     (ptr_mp),y
         sta     (ptr),y
         dey
@2:     lda     (ptr),y
         adc     (ptr_mp),y
         sta     (ptr),y
         dey
@1:     lda     (ptr),y
         adc     (ptr_mp),y
         sta     (ptr),y
         dey
@0:     lda     (ptr),y
         adc     (ptr_mp),y
         sta     (ptr),y
         tya
         bne     :-
         txa
         bne     :--
         rts

My last challenge is the tya/bne code.  txa/bne can be fixed with inx 
before the loop.  tya is costing my 2 cycles.  The size of the array is 
~420 in my testing, so 840 cycles, but I call this ~1000x.  That is 
almost a second! :-)  But the 4x unroll reduces this to less than .25 sec.

I need to move ptr back one and y up one.  But the code is unpleasant to 
look at.  I could really use a "branch if #$FF".

However, my real challenge is the time taken in mp_div and mp_div16:

             calls  avg.cyl/call total time(s)
             -----  ------------ -------------
counter 19:     2          4063      0.007963  mp_set
counter 20:   931          6284      5.733878  mp_copy
counter 21:     4          8501      0.033321  mp_asl
counter 22:   463          8361      3.793781  mp_add
counter 23:   465          8467      3.858206  mp_sub
counter 24:  1396         39871     54.543258  mp_div
counter 25:   675         70268     46.479285  mp_div16


Anton, thanks again for the pointers.


(the aforementioned div16_mp):

.macro  div16_fast
         ldy     #16             ;index for 16 bits
:       asl     dividend+3      ;dividend/2, clear quotient bit
         rol     dividend+2
         rol     dividend+1
         rol     dividend+0
         sec
         lda     dividend+1      ;try subtracting divisor
         sbc     divisor+1
         tax
         lda     dividend+0
         sbc     divisor+0
         bcc     :+              ;too small, qbit=0
         stx     dividend+1      ;okay, store remainder
         sta     dividend+0
         inc     dividend+3      ;set quotient bit = 1
:       dey                     ;next step
         bne     :--
.endmacro

div16_mp:
         ;;setup and skip leading zeros
         sta     ptr             ; store ptr lo from A
         sty     ptr+1           ; store ptr hi from Y
         ldx     arraylen+1      ; full pages
         ldy     arraylen        ; partial pages
         beq     @0              ;   if not process full pages only
         inx                     ; bump up x +1 for the partial page
         clc                     ; adjust to point below source start
         tya                     ; - ex: ptr = $xx00, arraylen = 1
         adc     ptr             ; we want ptr = $(xx-1)01, y = $ff (255)
         sta     ptr
         bcs     :+
         dec     ptr+1
:       tya                     ; adjust y index value, y = -y
         eor     #$ff            ; one's complement
         tay
         iny                     ; two's complement
         lsr                     ; odd or even count, this is a form of
                                 ;     Duff's Device
         bcs     @0              ; b: even (checking A, not Y here)
         bcc     @1              ; b: odd (forced)
@2:     inc     ptr+1           ; next page
@0:     lda     (ptr),y         ; check even byte (0, 2, ..., 254)
         bne     @4
         iny
@1:     lda     (ptr),y         ; check odd byte (1, 3, ..., 255)
         bne     @4
         iny                     ; page done ?
         bne     @0              ; b: no
         dex                     ; full page left ?
         bne     @2              ; b:yes
         sec                     ; got here?  all zeros
         rts
@4:     ;; done skipping zeros, y and x set to continue
         tya                     ; y must be even
         lsr
         bcc     :+              ; y even
         dey
         cpy     #$FF
         bne     :+              ; y didn't roll back to FF
         dex
:       lda     #0
         sta     dividend        ; MSB
         sta     dividend+1      ; MSB-1
         stx     xreg            ; div16_fast/safe needs x reg
         beq     @6
@7:     inc     ptr+1           ; next page
@6:     lda     (ptr),y         ;
         sta     dividend+2
         sty     yreg
         iny
         lda     (ptr),y         ;
         sta     dividend+3
         div16_fast              ; need x and y reg
         ldy     yreg
         lda     quotient
         sta     (ptr),y
         iny
         lda     quotient+1
         sta     (ptr),y
         iny                     ; page done ?
         bne     @6
         dec     xreg            ; (dex) full page left ?
         bne     @7
         clc                     ; non zero result
         rts

Back to comp.sys.apple2.programmer | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

Fastest method to copy/process a range of bytes? Egan Ford <datajerk@gmail.com> - 2012-08-07 16:53 -0600
  Re: Fastest method to copy/process a range of bytes? Antoine Vignau <antoine.vignau@laposte.net> - 2012-08-07 18:57 -0700
    Re: Fastest method to copy/process a range of bytes? Daniel Kruszyna <dan@krue.net> - 2012-08-08 15:28 +0000
  Re: Fastest method to copy/process a range of bytes? aiiadict@gmail.com - 2012-08-07 22:07 -0700
  Re: Fastest method to copy/process a range of bytes? "Michael J. Mahon" <mjmahon@aol.com> - 2012-08-08 10:49 -0700
    Re: Fastest method to copy/process a range of bytes? "Michael J. Mahon" <mjmahon@aol.com> - 2012-08-08 10:59 -0700
  Re: Fastest method to copy/process a range of bytes? "Anton Treuenfels" <teamtempest@yahoo.com> - 2012-08-08 21:27 -0500
    Re: Fastest method to copy/process a range of bytes? Antoine Vignau <antoine.vignau@laposte.net> - 2012-08-08 22:12 -0700
      Re: Fastest method to copy/process a range of bytes? Antoine Vignau <antoine.vignau@laposte.net> - 2012-08-08 22:23 -0700
        Re: Fastest method to copy/process a range of bytes? "Anton Treuenfels" <teamtempest@yahoo.com> - 2012-08-09 18:35 -0500
          Re: Fastest method to copy/process a range of bytes? Jerry <awanderin@yahoo.ca> - 2012-08-11 01:03 -0600
            Re: Fastest method to copy/process a range of bytes? Antoine Vignau <antoine.vignau@laposte.net> - 2012-08-11 11:33 -0700
    Re: Fastest method to copy/process a range of bytes?  mmphosis <mmphosis@macgui.com> - 2012-08-09 06:09 +0000
      Re: Fastest method to copy/process a range of bytes?  mmphosis <mmphosis@macgui.com> - 2012-08-09 09:40 +0000
      Re: Fastest method to copy/process a range of bytes? "Anton Treuenfels" <teamtempest@yahoo.com> - 2012-08-09 18:54 -0500
        Re: Fastest method to copy/process a range of bytes? Antoine Vignau <antoine.vignau@laposte.net> - 2012-08-09 17:48 -0700
        Re: Fastest method to copy/process a range of bytes? Antoine Vignau <antoine.vignau@laposte.net> - 2012-08-09 17:46 -0700
          Re: Fastest method to copy/process a range of bytes? Michael J. Mahon <mjmahon@aol.com> - 2012-08-10 15:25 -0500
            Re: Fastest method to copy/process a range of bytes? Antoine Vignau <antoine.vignau@laposte.net> - 2012-08-10 14:23 -0700
    Re: Fastest method to copy/process a range of bytes? Egan Ford <datajerk@gmail.com> - 2012-08-12 08:56 -0600
      Re: Fastest method to copy/process a range of bytes? "Anton Treuenfels" <teamtempest@yahoo.com> - 2012-08-12 23:27 -0500
        Re: Fastest method to copy/process a range of bytes? Egan Ford <datajerk@gmail.com> - 2012-08-13 11:18 -0600
  Re: Fastest method to copy/process a range of bytes? Egan Ford <datajerk@gmail.com> - 2012-08-11 15:58 -0600
    Re: Fastest method to copy/process a range of bytes? Egan Ford <datajerk@gmail.com> - 2012-08-11 16:16 -0600

csiph-web