134 lines
		
	
	
		
			5.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			134 lines
		
	
	
		
			5.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
# Multiplication Routines
 | 
						|
 | 
						|
This directory includes two implementations of byte by byte multiplication
 | 
						|
in native code. It's pretty fast, but comes at the cost of a very large 
 | 
						|
(two pages) lookup table.
 | 
						|
 | 
						|
The algorithm used is [Quarter square multiplication](https://en.wikipedia.org/wiki/Multiplication_algorithm#Quarter_square_multiplication)
 | 
						|
with some caveats: multiplying two 8-bit values would require a four page
 | 
						|
lookup table, and I couldn't find a way to implement that.
 | 
						|
However it was possible to implement a two-page table sufficient for
 | 
						|
multiplying 7-bit numbers.
 | 
						|
This is then extended by other means.
 | 
						|
 | 
						|
This code has not been tested on real hardware,
 | 
						|
and it uses instructions that do not seem to be used elsewhere
 | 
						|
(bne ac and bge ac). gtemu.c and asm.py seem happy with them,
 | 
						|
so I do expect them to work.
 | 
						|
 | 
						|
The purpose of this code is threefold:
 | 
						|
 | 
						|
* Just for fun. I read about the algorithm and wanted to see how hard it was to implement, and if it turned out to be fast.
 | 
						|
* As a demonstration of an approach to native Gigatron programming. I've tried to make the code relatively easy to read, more below.
 | 
						|
* As a testbed and demonstration of py-gtemu
 | 
						|
 | 
						|
## What's here
 | 
						|
 | 
						|
There are two implementations, and a Mandelbrot demo.
 | 
						|
 | 
						|
### The proof of concept
 | 
						|
 | 
						|
The first implementation,
 | 
						|
and the one you should read if you want to understand the code is src/multiplication.asm.py.
 | 
						|
 | 
						|
This is a file that contains nothing but the multiplication routine.
 | 
						|
It was where I developed the code.
 | 
						|
If loaded on a Gigatron it would do nothing at all,
 | 
						|
it is exercised only by the tests in test/test_multiplication.asm.py
 | 
						|
 | 
						|
If you want to understand how it works,
 | 
						|
I suggest reading the commits that affected those files,
 | 
						|
rather than reading it all in one go.
 | 
						|
It should break the implementation into logical chunks:
 | 
						|
first looking up the low-byte of a quarter square,
 | 
						|
then looking up the high-byte,
 | 
						|
then implementing the full routine for 7-bit numbers,
 | 
						|
then extending it to 8-bit numbers.
 | 
						|
As the code got bigger the whole file became harder to understand,
 | 
						|
but I tried to keep the diffs readable.
 | 
						|
 | 
						|
### The SYS function
 | 
						|
 | 
						|
The second implementation is the same thing, but with some optimisations,
 | 
						|
and implemented as a SYS function.
 | 
						|
This is built in to a modified version of the ROMv5a.asm.py script. 
 | 
						|
It can be found in sys/ROM.asm.py.
 | 
						|
Again, the diffs are probably the best way to view the code,
 | 
						|
as there is naturally a lot of unrelated code in there.
 | 
						|
The reason that this had to be written in, 
 | 
						|
and not a seperate .py file, 
 | 
						|
is that I need code in the page after the shift table.
 | 
						|
 | 
						|
The major differences between the two versions are that in the SYS function:
 | 
						|
 | 
						|
* I neglected comments
 | 
						|
* I store the lookup table inverted, because
 | 
						|
* I do the subtraction step before the addition. This makes things a bit faster
 | 
						|
* I do the 7-bit multiplication first instead of last.
 | 
						|
  This means that the control flow splits but does not merge again,
 | 
						|
  which is better for SYS functions.
 | 
						|
* The cycle counting is done "Marcel style" with each instruction being numbered.
 | 
						|
 | 
						|
The sys function takes 120 cycles,
 | 
						|
or 92 if both bytes are known to be <128, 
 | 
						|
i.e. they have the top bit clear.
 | 
						|
I haven't used a slot to allow for the code moving, 
 | 
						|
but just started in the middle of the lookup table.
 | 
						|
This saves some cycles, and this is just a demo.
 | 
						|
 | 
						|
The SYS function takes the two bytes in sysArgs0 and sysArgs1,
 | 
						|
and returns a 16-bit product in vAC.
 | 
						|
Neither parameter is modified,
 | 
						|
to make it easy to reuse them on subsequent calls, 
 | 
						|
but sysArgs2-5 are trampled.
 | 
						|
 | 
						|
### Mandelbrot
 | 
						|
 | 
						|
In order to see how performant the SYS function is,
 | 
						|
I modified Mandelbrot.gcl to make use of it.
 | 
						|
 | 
						|
The answer is quite a lot faster than the vCPU shift-and-add implementation,
 | 
						|
but not as much as I might have hoped.
 | 
						|
In the default graphics mode it completes in under 8 minutes to render the first view,
 | 
						|
going by the clock on the screen.
 | 
						|
This takes something like 17 minutes using Marcel's code.
 | 
						|
 | 
						|
I suspect that there is scope for improvement,
 | 
						|
either by avoiding calling the SYS function with zero as a parameter,
 | 
						|
or rearranging code to minimise stalls.
 | 
						|
I'd like to do some profiling to see where time is being spent.
 | 
						|
 | 
						|
## How to run it
 | 
						|
 | 
						|
### To run the Mandelbrot demo
 | 
						|
Assuming you are in this directory,
 | 
						|
you can add ../../../Core to your Python module search path (e.g. set PYTHONPATH) and run
 | 
						|
 | 
						|
    python3 ./sys/ROM.asm.py "Reset=../../../Core/Reset_v5.gcl" "Main=mandelbrot/Mandelbrot.gcl" "Boot=../../../Apps/CardTest/CardBoot.gcl"
 | 
						|
 | 
						|
(convert slashes as appropriate on your platform)
 | 
						|
 | 
						|
This should generate a ROM.rom and ROM.lst file, which you can run in your preferred emulator
 | 
						|
 | 
						|
### To run the tests
 | 
						|
As above you need the Core directory on the search path, and you will also need:
 | 
						|
 | 
						|
* py-gtemu (follow the instructions in Contrib/psr/py-gtemu)
 | 
						|
* pytest
 | 
						|
* hypothesis
 | 
						|
 | 
						|
With these available you can just run pytest from this directory
 | 
						|
 | 
						|
You can see my approach to this
 | 
						|
(creating a virtualenv, with a .pth file for Core, and installing dependencies)
 | 
						|
in psakefile.ps1.
 | 
						|
 | 
						|
## Future work
 | 
						|
 | 
						|
I'd like to do the following
 | 
						|
 | 
						|
* Do more acurate benchmarking of Mandelbrot
 | 
						|
* Move the code for loading Mandelbrot.gcl in the test to py-gtemu
 | 
						|
* Do some profiling of MulShift8, and see if it can be sped up
 | 
						|
* See if using PyPy helps make any of the above faster
 |