September/October 1981

DTACK GROUNDED, The Journal of Simple 68000 Systems

Issue # 3 September/October 1981 Copyright Digital Acoustics, Inc

Can anyone identify the large integrated circuit on the right hand side of the circuit board below? Hint: it is NOT an Intel 4004!

We have encountered a fair amount of skepticism among Apple II and Pet/CBM users with regard to our ability to produce working hardware and software. This is prudent, considering the history of the personal computer industry.

We present herewith a photograph of one of the production prototypes for the Pet/CBM version of our board. We have completed the task of integrating our 68000 floating point package into the CBM BASIC ROM (a total of 27 bytes are changed in ROM $CXXX). On the next page, we provide timing information on calculations performed with and without the 68000.

Because of publishing leadtimes, by the time you read this we will very likely have production Pet/CBM compatible 68000 boards available for delivery off the shelf, and be testing the production prototype of the Apple II version of our board. The next (fourth) DTACK GROUNDED newsletter will feature a photograph of the Apple II production prototype. We trust that the photograph below and the timing data on the following pages will reduce your "skepticism index"!

A REAL, LIVE, WORKING SIMPLE 68000 ATTACHED PROCESSOR

CBM 8032/ 68000 TIMING DATA:

The data presented below was taken using two 2532 EPROMs. One replaces the 4K byte masked ROM in socket UD9. The other is in socket UD11, which is a user ROM at address $AXXX. The EPROM in UD9 is identical to the ROM it replaces with the exception of 27 bytes. These 27 bytes are 9 jump commands to the user ROM.

The user ROM has some utility routines which facilitate communication with the 68000, plus some other routines which allow the CBM 8032 to readily handle hexadecimal data, both for POKEing and PRINTing. Entry to these utilities is through jump tables at the start of the user ROM. Future modifications and additions to these utilities will not require the EPROM in UD9 to be changed.

The lower 1.4K bytes of the user ROM contains these utilities. The upper 1.5K of the user ROM contains the 68000 floating point code, assembled to run in the minimum configuration (4K byte) DTACK GROUNDED board. The following BASIC code is needed to initialize the 68000 and transfer the code from the user ROM into the 68000 RAM:

100 P=40960:SYSP:Q=P+102:F=36847 110 SYSQ:REM070010F4AA000600 120 POKEF,0

Exactly what this code does will be explained in detail later in this newsletter. For the time being, line 100 resets the 68000, line 110 transfers the 68000 floating point code from the user ROM to 68000 RAM and line 120 tells the CBM 8032 to let the 68000 do the floating point arithmetic. Once this code is executed, other BASIC programs can be loaded and run WITHOUT MODIFICATION; the arithmetic will be performed in the 68000, very quickly.

Apple II users should know that the CBM 8032 spends 15% of its time scanning the keyboard and updating a time variable; your times will be slightly faster.

TEST PROGRAMS 8032 68000 ------------- ---- ----- 130 B=SQR(3):FORI=1TO5000:A=B+I:NEXTI 20.8 SEC 20.0 SEC 140 B=SQR(3):FORI=1TO5000:A=B*I:NEXTI 31.4 SEC 20.1 SEC 150 B=SQR(3):FORI=1TO5000:A=B/I:NEXTI 31.9 SEC 20.3 SEC 160 FORI=1TO5000:A=LOG(I):NEXTI 130.4 SEC 23.5 SEC 170 B=SQR(3):C=1234:FORI=1TO5000:A=B*C/I+I:NEXTI 52.3 SEC 34.8 SEC 180 FORI=0T05STEP.001:A=SIN(I):NEXTI 151.6 SEC 54.9 SEC

By examining the 68000 run time for the first three test programs, it is obvious that there is a certain amount of overhead associated with the BASIC interpreter independent of the mathematical process itself. Please accept our word for now that this overhead is on the very close order of 17.5 seconds for the lines 130, 140 and 150, slightly less for lines 160 and 180, but more for line 170.

This interpretive overhead is what compilers are designed to reduce. On the next page, the times will be restated based on removing this overhead. Although no BASIC compiler can remove ALL of the interpretive overhead, this will give you some idea of how effectively our board can (potentially) work with a compiler.

The following table separates the execution time of the interpretive portion of the BASIC test programs on the previous page from the time required to perform the F.P. (floating point) arithmetic. The first line of the table refers to the test program line #. Then comes the total execution time of the test program using the CBM 8032 only. The next two lines are the interpretive overhead and the F.P. computation time of the test program. The sum of these two equals the original CBM 8032 execution time. By subtracting the interpretive overhead we can determine the 8032/68000 F.P. computation time, which is given on the next line.

We then estimate that a compiler could execute the interpretive portion of the program 10 times faster. By adding this time to the 8032/68000 F.P. computation times, we can see the potential speed improvement which could be obtained by simply linking the compiler to a minimum configuration (4K) DTACK GROUNDED 68000 board.

The time required for the F.P. add at each "NEXT I" has been moved to the F.P. side of the ledger. Thus, line 130 actually involves 10,000 F.P. adds, not 5,000. All times are in seconds.

LINE # 130 140 150 160 170 180 ------ --- --- --- --- --- --- original time: 20.8 31.4 31.9 130.4 52.3 151.6 interpretive overhead: 17.4 17.4 17.4 15.4 29.2 ? 8032 computation: 3.4 14.0 14.5 115.0 23.1 ? P032 - 68K computation: 2.6 2.7 2.9 8.1 5.6 ? compiler time (est): 1.7 1.7 1.7 1.5 2.9 ? net compiler + 68K time: 4.3 4.4 4.6 9.6 8.5 ? SPEED RATIO: 4.8 7.1 6.9 13.5 6.2 ? ------------ --- --- --- ---- --- --- (estimated compiler + 68K vs. original CBM 8032)

By adding the estimated compiler time to the 8032 computation time we can get an estimated performance improvement when using the compiler WITHOUT the 68000:

LINE # 130 140 150 160 170 180 ------ --- --- --- --- --- --- compiler + 8032 computation time: 5.1 15.7 16.2 116.5 26.0 ? SPEED RATIO: 4.1 2.0 2.0 1.1 2.0 ? ------------ --- --- --- --- --- --- (estimated compiler without 68000)

Although the figures listed above are based on the ESTIMATED performance of a BASIC compiler (which supports floating point operations), it should be noted that the estimated speed improvement of 10% for the test program of line 160 is very close to the MEASURED speed improvement of 8% for an Apple II with the Hayden compiler as reported in issue #2 of this newsletter.

We will continue this discussion on the next page.

We would like to emphasize that the discussion on the previous page is based on MEASUREMENTS made on EXISTING 68000 hardware using EXISTING 68000 software. Estimated compiler performance was used partly because sufficient data was not available for a particular compiler and also because we wanted to discuss compiler performance in general, not that of one specific compiler. Although it is necessary to couple the compiler to the floating point operations, this is AUTOMATICALLY done for any compiler which calls the CBM 8032 BASIC ROM floating point routines.

POTENTIAL FURTHER IMPROVEMENTS:

The performance improvements reported up to now are based on existing software and a minimum configuration 68000 board. Let us assume that we have as much memory on the 68000 board as is needed to keep the F.P. variables in 68000 rather than 6502 memory space. Since our basic 68000 board has up to 92K bytes available, this is no problem (other than cost). However, we have to assume that the BASIC compiler has been modified to allow this (CAUTION: while we had our feet firmly planted on solid ground up to now, this last assumption takes us into blue sky country with attendant treacherous footing).

There is an important distinction between the "8032 - 68000 computation" time reported on the last page and the computation time of a 68000 alone. Consider the test program in line 130 two pages back. The 8032 - 68000 computation time is 2.6 seconds for 5000 loops. This involves 10,000 F.P. adds, as reported earlier. Each add involves transferring 13 bytes (two operands) TO the 68000 and receiving 8 bytes FROM the 68000 (an error status byte, sign, exponent, four mantissa bytes and the guard byte). That's 21 bytes to be transferred for each add. This data is transferred in 210 microseconds. Eliminating this data transfer time by keeping the variables in the 68000 would speed up the calculation time from 2.6 seconds to 0.5 seconds!

While the data transfer problem could not be completely eliminated, modifying the compiler to keep the data in the 68000 would very likely result in some of the non-arithmetic manipulations being performed in the 68000, with resulting speed benefit. Therefore, we feel justified in listing the performance of a MODIFIED compiler and a non-minimum configuration 68000 board:

LINE # 130 140 150 160 170 180 ------ --- --- --- --- --- --- modified compiler + 68K time: 2.2 2.3 2.5 7.8 4.3 ? SPEED RATIO: 9.4 13.7 12.8 16.7 12.2 ? ------------ --- ---- ---- ---- ---- --- (modified compiler with 68000)

HOW RELIABLE IS ALL THIS DATA? It's very reliable. Using an oscilloscope, we carefully measured the execution time IN THE 68000 of each of the test programs listed two pages back. This execution time, using our standard 68000 floating point package, included the data transfer time. Since all, repeat ALL, of the floating point operations were being performed in the 68000, the time left over is by definition the overhead of the BASIC interpreter not counting arithmetic computations.

Next, we modified the floating point routines one at a time so that the entrance and exit points were on a different RAM (in the 68000). This allowed us to determine the execution time of each routine by using an oscilloscope to look at the chip select lines. What was left over from the total 68000 execution time was the data transfer time.

On the next page we will present the results of these measurements.

Those who are familiar with floating point operations will know that the execution time is not constant. The add routine in particular is subject to variation based on the number of shifts required to align the two mantissas. Also, a subtract takes almost exactly the same time as the add. The times given here are averages based on the test programs listed previously, with spot checks made with "randomish" operands to see if there were any glitches in the timing (none were found). Since the multiply and divide routines make use of the hardware multiply and divide facilities in the 68000, there is very little variation in execution speed for these routines. Approximate CBM 8032 execution times are included for comparison.

OPERATION: CBM 8032 WITH DATA XFR WITHOUT XFR ---------- -------- ------------- ----------- F.P. ADD 340 u SEC 260 u SEC 50 u SEC F.P. MULTIPLY 2460 u SEC 284 u SEC 85 u SEC F.P. DIVIDE 2560 u SEC 325 u SEC 130 u SEC LOGARITHM 22660 u SEC 1350 u SEC 1220 u SEC

The data above seems to indicate that the data transfer time for multiply and divide is slightly less than for the add. We have no explanation for this. The data transfer time for the log routine is ACTUALLY less since it is a monadic function (only one operand needs to be passed to the 68000 for computation).

The above information will be of interest to dedicated number crunchers and real time programmers, but most of us are interested in performance at the system level. In other words, how much faster is my computer going to run?

If the program(s) you are primarily interested in are I/O limited rather than computationally limited, none of this will be useful to you. Read on if you are waiting 30 seconds between printouts with your data base management program!

Using the data on the preceding pages, we can now summarize the SYSTEM LEVEL performance improvements which can be obtained using our 68000 board without a compiler, a compiler without our board, and a compiler with our board. Finally, we can ESTIMATE the improvement available when the variables disappear inside the 6800 memory space.

The numbers in the next table are not times but are rather SYSTEM SPEED RATIOS referenced to standard interpretive BASIC:

LINE # 130 140 150 160 170 180 ------ --- --- --- --- --- --- CBM 8032 1.0 1.0 1.0 1.0 1.0 1.0 with compiler, no 68K 4.1 2.0 2.0 1.1 2.0 ? with 68K board, no compiler 1.0 1.6 1.6 5.6 1.5 2.8 compiler with 68K board 4.8 7.1 6.9 13.5 6.2 ? modified compiler w/68K 9.4 13.7 12.8 16.7 12.2 ?

We believe the data presented above proves the point discussed in detail in newsletter #2: a BASIC compiler works best with a high speed math processor, and a high speed math processor works best with a compiler. The 68000 can do much more than simply perform high speed arithmetic, of course. But RIGHT NOW, that's the only software we have to offer. But we DO have this software (plus Pet/CBM production hardware by the time you read this) available NOW. Apple II hardware and software will follow about a month behind.

Let us restate the argument presented in our last newsletter: an "average" BASIC program spends half of its time performing floating point arithmetic and half of the time interpreting the program (we include under "interpreting" everything that does not include F.P. arithmetic). A BASIC interpreter will run only TWICE as fast if it is fitted with a device which will perform the arithmetic in zero time. Similarly, a BASIC compiler will run only TWICE as fast if the compiler completely eliminates the interpretive overhead. Only when we combine the compiler with high speed arithmetic do we achieve very large, significant speed advantages.

In the table at the bottom of the last page, lines 140 and 150 are fairly "average" programs. Note that neither the compiler nor the 68000 board can produce more than a 2-1 speed improvement. But the two combined in a simple way produce a 7-1 speed advantage. Now, that is a SIGNIFICANT improvement. By modifying the compiler to keep the variables in the 68000 memory space (which has not yet been done) nearly doubles that to about a 13-1 speed advantage.

Line 130 in that table represents a simple application with relatively little number crunching. The compiler speedup is good, the 68000 speedup almost nil. Line 160 represents heavy number crunching, the 68000 speedup is very substantial and the compiler advantage nil. Still, the combined compiler and 68000 produce a substantial speedup for both types of programs. The plain fact is that a compiler-68000 combination is ideally suited to ALL classes of BASIC programs.

The information presented in the preceding five and a half pages is completely true, UTTERLY FALSE, and completely true again. We concede this might be slightly confusing, so please let us explain:

All of the system timing data up to now has been taken using very short programs, typically two to four lines each. The test programs on page two were run one at a time, for instance. However, most of you already know that longer programs seem to run more slowly. We will now examine this aspect of system timing in detail, and discuss the two methods of COMPLETELY ELIMINATING this problem (like we said; true, false, then true again).

Consider the following program:

100 A=0:B=1:I=0:GOSUB120:PRINT"START" 110 FORI=1TO1000:A=A+B:NEXT:PRINT"DONE":END 120 A0=0:A1=0:.....A9=0 370 Z0=0:Z1=0:.....Z9=0 380 X=0:Y=1:RETURN

The above listing is incomplete, but we think you can figure out what lines 130 thru 360 should be. When this program is run on our CBM 8032, the time required is 3.5 seconds between "START" and "DONE". But if we change line 110 as follows:

110 FORI=1TO1000:X=X+Y:NEXT:PRINT"DONE":END

The program now takes over 29 seconds to run! That's more than 8 times slower!! Now consider another program:

100 FORI=1TO1000:GOSUB101:NEXT:PRINT"DONE":END 101 RETURN 102 PRINT 599 PRINT 600 RETURN

This listing is also incomplete; lines 103 thru 598 should also be "XXX PRINT". On the next page we will continue this discussion.

This program runs in 2.85 seconds. But if we change the "GOSUB101" in line 100 to "GOSUB600" the program requires about 31 seconds to run, over TEN times slower!!

We have now identified two fairly simple programs which can be shown to run about ONE ORDER OF MAGNITUDE more slowly in (artificially constructed) large programs. Since DTACK GROUNDED is dedicated to running FASTER rather than slower, let's figure out exactly what is happening and how to fix the problem!

The first program ran at two drastically differing speeds because the first version referenced variables A and B, which were at the start of the variable table. When we substituted X and Y, they were about the 263rd and 264th variables in the variable table. When Microsoft BASIC finds a variable name it searches the variable table, starting at the front of the table for a match to the name of the variable. The search finds A at the front of the table much more quickly than X at the end of the table!

In fact, we can quantify the search time using this program: there are three variable references in each of the 1000 loops, or 3000 references all told. The program runs 26 seconds slower when searching 250 additional positions in the table 3000 times. Therefore, each additional variable in the variable table requires about 33.3 microseconds on average in the CBM 8032. The Apple II will be about 15% faster.

The second example program ran 31 seconds more slowly when calling a subroutine to the 501st line of the program than when calling a subroutine to the second line of the program. The reason is that Microsoft BASIC, when interpreting "GOSUB XXX", where XXX is the line number, starts at the beginning of the program and looks at the number of the first line. If that number doesn't match XXX it checks the next line, etc. (because of the way Microsoft BASIC links the lines, it does not matter what the length or content of the various lines is; the example given is therefore perfectly general).

Once again, we can readily determine that each line number to be searched requires about 62 microseconds.

Folks, we have just determined why programs run more slowly as they get longer. Longer programs have more variables, so that it takes more time to look up the average variable. GOTOs or GOSUBs to the later lines of a long program require more time to locate the line number.

In fact, a VERY LONG program probably spends 90% of the time just performing these two search functions. This means that it runs 10 TIMES SLOWER than the sum of each of its parts! Now, how will the DTACK GROUNDED board lick this problem? It won't. The problem is not a hardware problem and the fix has nothing to do with hardware.

The problem is related to the implementation of the Microsoft, and some other, BASIC interpreters. As many of you know, Bill Gates wrote the BASIC interpreter for the original ALTAIR microcomputer, the one that started this whole mess. Now, the ALTAIR originally came with pretty small memory configurations (remember those 1K, repeat 1K, memory boards?). The problem that we have just identified occurs in large programs ONLY. There were no large programs (in those days, on the ALTAIR) because of the typically small, by today's standards, memory configurations. It can be fairly stated that this problem did not exist when this (Microsoft) BASIC interpreter was originally written.

It was in those early days of personal computing that a very funny (as in peculiar) thing happened:

Now, you probably aren't going to believe this, but remember, this happened a long time ago: a few people began to steal the BASIC interpreter that Bill Gates wrote for the ALTAIR. Bill got upset. MORE people stole his interpreter. Bill wrote at angry letter to editor of a personal computer publication! Even more yet of those dirty ratfinks stole Bill's interpreter. Bill denounced these depraved criminals!!

In fact, so many people stole Bill's interpreter that it became the
*de facto* industry standard and led to Microsoft's preeminent
position in the software industry, making Bill Gates a wealthy man in the
process. If there weren't so many dishonest personal computer users
around, Bill would probably have a very nice job today doing applications
programming for some software house (and have time to return his phone
calls).

But as memory prices dropped and BASIC programs grew larger, Bill's BASIC interpreter got slower and slower. You may ask, don't ALL interpreters have this problem? The answer is, NO!

Let us introduce a company known familiarly as HP which once built very nice audio oscillators but now builds computers. HP originally got into the computer game at the large minicomputer level, where large memory configurations dominated. Therefore, HP's interpretive BASIC works very differently.

When HP BASIC first encounters a variable name, it assigns the variable name to the variable table, just like Microsoft. But it then replaces the variable NAME with the variable ADDRESS, right there in the BASIC program. When it first encounters a line number, the line number is replaced with the address of the line number. In other words, HP BASIC semicompiles itself at run time. When the program stops running, the program UNsemicompiles itself, so that the program can be listed or edited.

As a result, very large programs written in HP BASIC run just as fast, proportionally, as small programs. Although this technique was originally developed on some fairly sizeable minicomputers, the technique was carried over into their 9830 and later small BASIC language machines. That's a good idea, you say? Well, large companies like HP occasionally do things correctly.

Microsoft now has so many versions of BASIC available that this "large program" problem may not exist on all versions. However, it is a severe problem on currently available Pet/CBM and Apple II machines.

The most obvious fix for this problem is the one which is currently available: the (various) BASIC compilers. Compiled BASIC does not have variable names, it has variable addresses. It does not have line numbers, it has addresses. In other words, compiled BASIC is in some respects very similar to HP BASIC. Most of the advertised speed increase available from compilers is due to this "fix" for the large program problem. BASIC compilers do not otherwise result in any large increase in speed - UNLESS coupled with a fast arithmetic processor.

In the July '81 issue of PRINTOUT (a British publication specializing in the Pet) and the July/August '81 issue of PEELINGS II there appeared reviews of several compilers. Both publications expressed puzzlement over the large discrepancy between the advertised performance claims and the results of their tests ... which involved very short programs.

We poked a little fun at HAYDEN over a short benchmark in our last issue, too. We still think that HAYDEN is remiss in advertising an unqualified 3 times (minimum!) speed improvement. It cannot deliver that performance for a very significant number of programs. We have at hand a copy of the Osborne/McGraw-Hill "Some Common Basic Programs". It is doubtful that the HAYDEN compiler would speed even ONE of the 76 programs in this book up by a factor of three.

Now, let us refer back to the table at the bottom of page 5: the reference speed of 1.0 was for the CBM 8032 WITHOUT the "large program problem". If you have a very large BASIC program which runs slowly, the combination of a compiler and the 68000 board can result in your very large program running not 7 times faster, but (up to) 70 times faster!

Also, the combination of a compiler and our 68000 board would result in about a 7 times speed improvement for most, but not all, of the programs in "Some Common Basic Programs".

We should be able to provide more specific information in future issues because we have already been contacted by a software house which markets a BASIC compiler to cooperate in getting their compiler to work with our board. Why would a software house be interested in this, when there are no boards at all out there at this time? Here are some reasons:

- There is a lot of interest in our boards. A lot of people, including some software houses, perceive that our program will be successful ... ESPECIALLY after the Apple Computer Co. introduces its $10,000 68000 machine.
- Beginning to work with the 68000 is a smart move at this time for a software house. Working with our board and with our already developed math package is a relatively painless way to start.
- It is obvious to any competent compiler writer (is that a redundancy?) that our assertions about the mutual need of the compiler and the math processor are essentially correct. Almost any compiler writer is by definition interested in running FAST!
- With all of the interest in the 68000, which will dramatically increase when you-know-what happens, it makes sense for a vendor of any serious software, including compilers, to be able to advertise that their product is ALREADY "68000 compatible", resulting in such-and-such a performance improvement when a 68000 supercharger is available for the Pet or Apple (not word one about DTACK GROUNDED, in case you didn't notice).

We said that a compiler is ONE solution which is currently available to solve the "large program problem". Another solution (which is NOT currently available) is to modify the Pet or Apple BASIC interpreter to work in the same manner as HP's. We think that this is do-able. We certainly intend to spend some time looking at this once the Apple II version of our board is finished (that is not the same as promising to do the job; HP may employ better programmers than we do).

In the last issue of this newsletter, we said that there were certain (relatively simple) hardware modifications that had to be performed to integrate our hardware and software with the CBM 8032 BASIC operating system. Since then we have looked fairly carefully at the Apple II, and discovered that no hardware modifications would be necessary if a device called a "language card" or its equivalent were available. This is a RAM card which overlays the memory area normally occupied by the BASIC ROM. Once the Applesoft BASIC interpreter is loaded into RAM, the simple "hooks" to the 68000 are trivially simple to add.

In our first newsletter we admitted that most of what we called the "first round" buyers of our board would be hardware hackers or would know a hardware hacker to help him get started. If there were a Pet/CBM equivalent to the "language card", the installation of our 68000 system and its integration into the BASIC operating system would be greatly simplified. Since nobody else is making such a card (as far as we know), we will make one ourselves.

SOFTWARE STATUS REPORT: An improved version of the mini-monitor ROM has been completed and integrated into the production prototype 68000 board. The 6502 utility code has also been modified; see page two.

All four floating point functions have been completely integrated into the CBM BASIC ROM with an option: depending on the contents of ONE byte in the 2K private RAM ($8800-$8FF7), the math is optionally performed by either the 8032 or the 68000. Upon reset, the 8032 is the default number cruncher.

Once the 68000 has been loaded and the flag byte cleared other BASIC programs can be loaded and run without any modification; the use of the 68000 to perform the math is completely transparent to the user and to the BASIC program.

HARDWARE STATUS REPORT: The production prototypes of the Pet/CBM versions are completely operational. The first full production versions (with solder mask and screen) have been ordered. Modification of the main board layout to the Apple II configuration is under way.

Since the 68000 board with minimum memory pulls about 0.7 Amperes, small memory configuration Apple II boards will draw power from the Apple II. Although this is within specification for the slot power, we have been told that some Apple II peripherals (cough!) pull more than their share. Therefore, provision will be made to separate the 68000 board from the Apple power bus and a separate 5 volt regulated supply to be used. This will be NECESSARY for large memory configurations. A "language card" or equivalent MUST be available.

APPLE II VERSION PRICING: For small memory configuration versions, add $50 to the price of the equivalent Pet/CBM version as listed in the last newsletter. We arrived at this as follows: the Apple II version requires an extra parallel interface board; add $70. But the Apple doesn't need the private 2K RAM, subtract $20. Large memory configurations will require the aforementioned regulated supply, which is NOT included in the price (we'd rather not build it, frankly). The warranty and terms of sale are the same, of course.

TRUE CONFESSIONS DEPT: The first issue of this newsletter was completed on 25 July. We put a July date on it because we thought the first few issues of this newsletter could be produced at three week intervals. HA! So this is the Sept/Oct issue (our subscription is on a per-issue basis, anyhow).

Also, it appears that the explanation of our improved mini-monitor isn't going to make it in this issue, as was promised on page 2. You see, this is a two postage stamp 1st class newsletter. That's 7 pieces of paper, printed both sides, in a large envelope.

ACKNOWLEDGEMENTS: You probably didn't know by now that Apple and Apple II are trademarks of the Apple Computer Co. and that Pet and CBM are trademarks of Commodore Business Machines.

SUBSCRIPTIONS: $15/6 issues U.S. and Canada, $25 U.K. Payment should be made to DTACK GROUNDED. The subscription will start with the first issue unless otherwise specified. The address is:

DTACK GROUNDED

1415 E. McFadden, St. F

SANTA ANA CA 92705

If you received this newsletter as a free sample or via photocopy machine discount, this is the end. For subscribers, the next four pages continue the discussion of the use of the 16 bit hardware divide AND MULTIPLY to do extended division which started in the last issue. Toodles!!

WELCOME TO REDLANDS!

This issue continues the discussion of division. Having listed the code for a traditional 'shift and maybe subtract' algorithm for a 64 bit divide routine (result good to 62 or 63 bits), we have a means of checking the accuracy of our resulting code. The same algorithm could be used for the 32 bit divide needed for a Microsoft 9 digit compatible routine, but it is significantly slower than can be achieved using the hardware divide.

The Microsoft 8 digit floating point divide routine works like this: FPACC#1 is first rounded (the most significant bit of the guard byte is added to the mantissa) and then divided INTO FPACC#2. Therefore, we have no guard byte going INTO the divide routine, but we do have one when we EXIT the divide routine. For this reason, we want the accuracy of the result to be GREATER than 32 bits.

There are two possible errors which can occur upon dividing: there can be an attempt to divide by zero or there can be an overflow (result with larger than permissible exponent). To either case, the execution of the BASIC program should stop with the appropriate error message displayed. When using the 68000, an error status byte is returned to the host. This byte is normally zero. If an error occurs, the host dumps to the appropriate error message as determined by the handshake software when working with the 68000. Therefore, selection of the appropriate error code can be made arbitrarily by the programmer of the host/68000 handshake software. We use #1 to indicate overflow, #2 to indicate an attempted divide by zero.

Underflow can also be a result of the computation (resulting exponent too small) but this is treated as a zero result, NOT as an error.

The first code on page 14 is the error and zero result exit points for the divide routine. FPDIV and FPDIV1 are the two entrance points depending on whether FPACC#2 already contains an operand or not. Beginning at FPDIV1, we fetch the 32 bit mantissa #1 (MANT1) plus the two exponents to the data registers for easy manipulation.

In Microsoft 9 digit basic, a zero F. P. number is indicated by a zero exponent. If the exponent for FPACC#1 (X1) is zero, we have a 'divide by zero' error. If the exponent for FPACC#2 is zero, the result is zero (ANYTHING, other than zero, when divided into zero results in zero). If neither exponent is zero, which is usually the case, we proceed. Notice that we cleared D4 and D6, 16 bits, so that the byte loads of the exponents leaves bits B8 thru B15 of D4 and D6 still zeroed. Finally, we move the 32 bit (.L= 32 bits) mantissa #1 into the 32 bit data register D0. All of FPACC#1 except the guard byte and sign is now contained in the data registers.

Our next operation is to round FPACC#1. We shift the most significant bit of guard byte G1 into the carry. If the carry is zero, we skip to FPDIV2. Otherwise, we 'add quick' .L #1 to D0. If the result of this operation does not generate a carry, we skip to FPDIV2, and continue.

Only if MANT1 was $FFFFFFFF before adding one will a carry result from the add. In this case, D0 will contain $00000000, which is a decidedly un-normalized mantissa. So, we rotate the carry back into the most significant bit of D0, making it equal $80000000, and increment the exponent. Since the exponent is in a 16 bit location, we do NOT have to test for an overflow at this point.

Eventually we arrive at FPDIV2 by whatever route. Now we calculate the exponent of the result, but again we do not test for overflow or underflow yet; the exponent may be modified as a result of the integer 32 bit divide. Then we calculate the sign of the result, and we are ready, finally, to start the integer divide.

We begin by clearing D5, which is going to serve as a 'flag' byte for us. Then we move the 32 bit mantissa #2 (M2) to D7. Since both M1 and M2 lie in the range 0.5 to 0.999... the result of the divide can range from 0.5 to almost 2. M2 is the numerator, M1 the denominator. If M2 is greater than M1, the result will be greater than one. If M2 is equal to M1, the result is one, exactly. We begin by comparing D0 (M1) to D7 (M2). If the Z flag is set, the two mantissas are equal and we take the BEQ branch to the code at 'ONE'. If the carry bit is one, we take the branch to 'D0LOD7'.

If neither of these branches are taken, M2 is greater than M1 and the result will be greater than 1. In this case, we add #3 to our flag byte D5 and subtract D0 from D7 (we subtract the denominator from the numerator). We now know for sure that the numerator is LESS than the denominator. That's a good thing, because we have arrived at the label 'D0LOD7'. D7 is now an unnormalized mantissa, but that is O.K. for the succeeding operations.

Assuming the mantissas were not equal (the branch to 'ONE' was not taken) we now have what is needed to proceed further with the integer divide: a numerator which is smaller than the denominator.

(At this point it is important to review the 'INTRODUCTION TO THE 68000 HARDWARE DIVIDE' at the bottom of page 14 of newsletter #2, or some of what follows won't make sense.)

Now we temporarily store D5 in A1 and call the subroutine 'DIVSUB', whose name is not overly original. This subroutine is printed on page 18, and divides the most significant 16 bits of the denominator into a 32 bit numerator. When we first call this subroutine, we can be SURE that the 16 most significant bits of the numerator are less than the 16 most significant bits of the denominator, right? WRONG!! Consider these operands:

numerator = $8000 0000 denominator = $8000 FFFF result = $FFFE 0005 ; remainder = $FFF0 (1st 16 bits)

The condition that the numerator is smaller than the denominator is met, but an overflow is going to be generated when the upper 16 bits of the denominator are divided into the 32 bit numerator. In other words, the hardware divide will not divide at all, but will set the V flag, indicating "oops!"

The subroutine begins by placing a copy of D7 (which is M2 the first time the subroutine is called) in D6. Then we copy the upper 16 bits of D0 (M1) into the lower 16 bits of D1, leaving the upper 16 bits of D1 zero (MOVE .L D0, D1; CLR .W D1; SWAP D1). Now, at long last, we invoke the hardware divide command DIVU. If the V flag is clear after this instruction, the division took place and we branch to 'DIVOK'. If the V flag is set, the instruction tells us that the result is one or greater. However, our previous code assured us that the result will be LESS than one for the complete 32 bit divide. Therefore, we know for a fact that the result of the 16 bit divide is misleading us; the highest possible result is $FFFF. So, we move $FFFF into the lower 16 bits of D7 as a trial result. The remainder in the upper 16 bits of D7 is left alone, since all we have is a trial result anyway.

Incidentally, this trial result is STILL too large for the example operands listed above; the most significant 16 bits will in fact be $FFFE. We should mention here that the trial result may not be accurate even if the divide executes properly. The trial result will never be too low, but it may be as much as two bits too high.

For example, consider these operands:

numerator = $8000 0000 denominator = $8001 FFFF correct result = $FFFC 0011 ; remainder = $FFB0 (1st 16 bits)

The result of the first hardware divide is $FFFE with a remainder of $0002. Even though the hardware divide ($8001 into $8000 0000) operated correctly, the most significant 16 bits do not accurately represent the eventual correct result. However, we ARE within two least bits or .06% of the correct result. Since we KNOW that the result of the hardware divide is potentially wrong, we need a strategy to check and correct it. We do this by multiplying the 16 bit result (or trial result) times the 32 bit denominator:

$FFFE times $8000 FFFF equals $8000 FFFB 0002

We do this by moving the lower 16 bits of D7 to D2 and D3, then multiplying D0 times D2 ($FFFF times $FFFE) and D1 times D7 ($8000 times $FFFE). This is a 16 bit times 32 bit multiply with a 48 bit result. We store the least 16 bits in D5 and then clear the lower half of D2 by CLR .W D2. The upper 16 bits are left intact. Then we swap the upper and lower half of D2 and add all 32 bits to D7. The result in D7 is $8000 FFFB. The upper 32 bits of the 48 bit result is in. D7, the lower 16 bits in D2.

Now we subtract the 48 bit result from the 32 bit numerator. First we 'extend' the numerator to 48 bits by clearing D2 again. Now we subtract D5 from D2 and then subtract D7 from D6 with extend (SUB .W D5, D2; SUBX .L D7, D6). For our example, the result will be $FFFF 0004 FFFE with the N flag set. We do not take the branch at 'BPL SHIFTW' and instead proceed with the code at label "NEG" . We subtract one from our trial result, making it $FFFD.

Next we add the denominator, shifted 16 bits to the right, back to the (48 bit) numerator. We do this by adding the lower 16 bits of D0 to D2 and adding with extend D1 (the upper 16 bits of the denominator) to D5. The first time we do this we get $FFFF 8005 FFFD with the N flag set. So we take the backward branch at 'BMI NEG' and repeat. The trial result is now $FFFC, which is the correct first 16 bits of the result. This time we get the 48 bit numerator $0000 0008 FFFC with the N flag clear.

Then we swap D6, making it $0008 0000 and then move .W D2 to D6, making it $0008 FFFC. This has the effect of shifting D2 and D6 left 16 bit positions. Now we return to the code at $1404 and swap D3. This places the most significant 16 bits of the result in the upper half of D3. We move D6 to D7 to serve as the next 32 bit numerator and call subroutine 'DIVSUB' again. This generates the next 16 bits of the result in the lower half of D3, so that D3 contains the most significant 32 bits at the result. Then we divide D1 into D6 to determine the guard byte. The result is $FFFC 0011 FF(B2).

The (B2) is in parentheses because Microsoft basic only uses a guard BYTE, $FF in this case. Note, however, that there is a two bit error in the guard WORD compared to the correct result as determined by the 64 bit divide routine published in last issue's REDLANDS. Since this corresponds to 2/65536 of one least bit of the mantissa, we can ignore it. You DID know that there is in general no such thing as a completely accurate result of a floating point computation, didn't you? What we try to do is reduce the errors as much as possible given the precision of the F. P. package.

Not incidentally, the Microsoft 9 digit package carries a full guard BYTE during add, subtract and multiply calculations but only a BIT after the divide calculation. Therefore, the result of the 68000 divide calculation is in general significantly more accurate than the Microsoft algorithm.

WHY WOULD ANYBODY IN HIS RIGHT MIND GO TO ALL THAT EFFORT? Especially since a nice simple 32 bit 'shift and maybe subtract' algorithm would be even simpler than the 64 bit version printed last month? Because this (admittedly complex) algorithm runs about two and a half times faster, that's why!

Next issue we will discuss the possible normalization and adjustment of the exponent after the integer divide, plus the similar operations for the F. P. multiply to go with the integer multiply of the first issue.

If we continue to publish REDLANDS in half size format, we will also be able to discuss the mini-monitor we have written for the 68000 and methods of transferring data between the 68000 and its host.

We now (Sept 28 '81) have enough subscribers to form a pretty good cross section of those people who own Pets or Apples and are interested in the 68000. We need some feedback from YOU, the fully paid-up reader of REDLANDs. What questions do you have about the 68000 or its relationship to a particular host? What ideas do you have for future issues of DTACK GROUNDED? Do you think we ripped off your $15? What do you think of printing REDLANDS half size to get more info per issue?

If you will include an SASE we promise to answer any reasonable questions you may have even if we don't publish them ("How does the instruction set of the 68000 work" is an UNREASONABLE question). Changing this monologue to a dialog will work to everyone's benefit. Who knows, we might even make sense some day!

1 OPT P=68000,BRS,FRS 0013A6 2 ORG $0013A6 3 0013A6 70 01 4 OVFL MOVEQ.L #1,D0 1= OVERFLOW 0013A8 66 02 5 BNE ERR2A 0013AA 70 02 6 DIVBY0 MOVEQ.L #2,D0 2= DIV BY ZERO 0013AC 4EF8 1392 7 ERR2A JMP ERROR ( REPORT FATAL ERROR ) 0013B0 4EF8 127E 8 DRZER JMP RZER ( THE RESULT IS ZERO ) 9 10 ******************************************** 11 *FETCH A 6 BYTE F.P. NUMBER TO FPACC#2, ROUND 12 *FPACC#1, THEN DIVIDE FPACC#1 INTO FPACC#2. 13 *THE RESULT IS STORED IN FPACC#1 WITH GUARD. 14 ******************************************** 15 0013B4 31D8 190A 16 FPDIV MOVE.W (A0)+,S2.W 0013B8 21D8 190C 17 MOVE.L (A0)+,M2.W 18 19 ******************************************** 20 * FETCH MANT1, BOTH EXPONENTS TO DATA REGS * 21 ******************************************** 22 0013BC 4244 23 FPDIV1 CLR.W D4 0013BE 4246 24 CLR.W D6 0013C0 1C38 1903 25 MOVE.B X1.W,D6 ( FETCH EXP1 TO D6 ) 0013C4 67 E4 26 BEQ DIVBY0 CAN'T DIV BY 0 0013C6 1838 190B 27 MOVE.B X2.W,D4 ( FETCH EXP2 TO D4 ) 0013CA 67 E4 28 BEQ DRZER ZERO IF EXP2= 0 0013CC 2038 1904 29 MOVE.L M1.W,D0 ( FETCH MANT1 TO D0 ) 30 31 ******************************************** 32 *ROUND X1, M1, G1 (X1, M1 ARE IN DATA REGS) 33 ******************************************** 34 0013D0 E1F8 1908 35 ASL G1.W ( MSB OF GUARD BYTE TO CY ) 0013D4 64 0A 36 BCC FPDIV2 SKIP IF NO CY 0013D6 5280 37 ADDQ.L #1,D0 INCR MANT1 0013D8 64 06 38 BCC FPDIV2 SKIP IF NO CY 39 40 ******************************************** 41 * SET MANT1= $80000000 AND INCREMENT EXP1 * 42 ******************************************** 43 0013DA E290 44 ROXR.L #1,D0 0013DC 5206 45 ADDQ.B #1,D6 46 47 ******************************************** 48 * THE RESULT IS ZERO IF EXP1 OVERFLOWS * 49 ******************************************** 50 0013DE 67 D0 51 BEQ DRZER RZER ON X1 OVFL 52 53 ******************************************** 54 * CALCULATE THE EXPONENT OF THE RESULT * 55 ******************************************** 56 0013E0 9846 57 FPDIV2 SUB.W D6,D4 16 BIT SUBTR 0013E2 D87C 0080 58 ADD.W #128,D4 (CORRECT EXP) 59 * (DO NOT TEST FOR OV/UNFL NOW) 60 61 ******************************************** 62 *CALCULATE AND STORE THE SIGN OF THE RESULT 63 ******************************************** 64 0013E6 1C38 190A 65 MOVE.B S2.W,D6 0013EA BD38 1902 66 EOR.B D6,S1.W 67 68 ******************************************** 69 * PERFORM 32 BIT INTEGER DIVISION * 70 ******************************************** 71 0013EE 4245 72 DIV CLR.W D5 0013F0 2E38 190C 73 MOVE.L M2.W,D7 0013F4 BE80 74 CMP.L D0,D7 0013F6 67 3A 75 BEQ ONE RESULT ONE IF EQL 0013F8 65 04 76 BCS D0GTD7 0013FA 5645 77 ADDQ.W #3,D5 0013FC 9E80 78 SUB.L D0,D7 0013FE 3245 79 D0GTD7 MOVEA.W D5,A1 001400 4EB8 1440 80 JSR DIVSUB 001404 4843 81 SWAP.W D3 001406 2E06 82 MOVE.L D6,D7 001408 4EB8 1440 83 JSR DIVSUB 84 85 ******************************************** 86 *REMAINDER IN D6, BITS 5 THRU 8 (GUARD BYTE) 87 ******************************************** 88 00140C 8CC1 89 DIVU.W D1,D6 90 91 ******************************************** 92 *THE INTEGER DIVISION IS DONE; NOW NORMALIZE 93 ******************************************** 94 00140E 3A09 95 MOVE.W A1,D5 001410 67 08 96 BEQ DIVX 97 98 ********************************************* 99 *THE RESULT IS EQUAL TO OR GREATER THAN ONE; 100 *NORMALIZE RIGHT & INCREMENT EXP1 101 ********************************************* 102 001412 E24D 103 LSR.W #1,D5 001414 E293 104 ROXR.L #1,D3 001416 E256 105 ROXR.W #1,D6 001418 5244 106 ADDQ.W #1,D4 107 108 ********************************************* 109 * TEST EXPONENT FOR UNDERFLOW OR OVERFLOW 110 ********************************************* 111 00141A 3004 112 DIVX MOVE.W D4,D0 00141C 6B 92 113 BMI DRZER 00141E 0240 FF00 114 ANDI.W #$FF00,D0 001422 66 82 115 BNE OVFL 116 117 ********************************************* 118 *RETURN EXP1, MANT1, GUARD BYTE TO FPACC#1 119 ********************************************* 120 001424 11C4 1903 121 MOVE.B D4,X1.W 001428 21C3 1904 122 MOVE.L D3,M1.W 00142C 31C6 1908 123 MOVE.W D6,G1.W 001430 4E75 124 RTS 125 126 ********************************************* 127 *THE TWO MANTISSAS ARE EQUAL; THE RESULT IS #1 128 ********************************************* 129 001432 5244 130 ONE ADDQ.W #1,D4 001434 263C 80000000 131 MOVE.L #$80000000,D3 00143A 4246 132 CLR.W D6 00143C 4EF8 141A 133 JMP DIVX.W *(RETURN, STORING X1, M1, GUARD) 134 135 ********************************************* 136 * SUBROUTINE: DIVIDE 16 BITS INTO 32 137 ********************************************* 138 001440 2C07 139 DIVSUB MOVE.L D7,D6 001442 2200 140 MOVE.L D0,D1 001444 4241 141 CLR.W D1 001446 4841 142 SWAP.W D1 001448 8EC1 143 DIVU.W D1,D7 00144A 68 04 144 BVC DIVOK 00144C 3E3C FFFF 145 MOVE.W #$FFFF,D7 001450 3407 146 DIVOK MOVE.W D7,D2 001452 3607 147 MOVE.W D7,D3 001454 C4C0 148 MULU.W D0,D2 001456 CEC1 149 MULU.W D1,D7 001458 3A02 150 MOVE.W D2,D5 00145A 4242 151 CLR.W D2 00145C 4842 152 SWAP.W D2 00145E DE82 153 ADD.L D2,D7 001460 4242 154 CLR.W D2 001462 9445 155 SUB.W D5,D2 001464 9D87 156 SUBX.L D7,D6 001466 6A 08 157 BPL SHIFTW 001468 5343 158 NEG SUBQ.W #1,D3 00146A D440 159 ADD.W D0,D2 00146C DD81 160 ADDX.L D1,D6 00146E 6B F8 161 BMI NEG 001470 4846 162 SHIFTW SWAP.W D6 001472 3C02 163 MOVE.W D2,D6 001474 4E75 164 RTS 165 # 00001902 166 S1 EQU $001902 # 00001903 167 X1 EQU $001903 # 00001904 168 M1 EQU $001904 # 00001908 169 G1 EQU $001908 # 0000190A 170 S2 EQU $00190A # 0000190B 171 X2 EQU $00190B # 0000190C 172 M2 EQU $00190C # 0000127E 173 RZER EQU $00127E # 00001392 174 ERROR EQU $001392