Have there been any parallel blitter implementations?

Question

I read about blitters - history and technical implementation (what I could understand), for example in Amiga and Bally Astrocade.

The principle is that there is a fast copying of memory from one place to another, faster than the processor could do.

Most of all, a blitter is needed to speed up the output of 2D graphics. And here I have a question.

Let's assume that the pixel format is RGB332 = 8bit per pixel, 256 colors in one pallete.

If you have bitmaps in RAM, then the best case for copying speed is 1 pixel per 1 cycle. And most likely you will need to use DMA. Or without DMA - allocate one chip of memory for the blitter for itself.

But the copying process can be parallelized. There are 32 KB of video memory for bitmaps. These 32 KB are divided into 4 blocks of 8 KB. And here the best case is 4 pixels per 1 cycle, since reading is done from 4 chips simultaneously. If there are 8 chips of 4 KB, then the best case is 8 pixels per cycle. In addition, it is necessary to make it so that 4 or 8 pixels per cycle can be written to the frame buffer. We do this in a similar way - split the frame buffer into several parallel chips.

That is, we read 4 or 8 pixels from the texture memory in parallel, process them (bit mask, change color channels, ...) and write in parallel to the frame buffer.

This method is more suitable for a blitter, if it has a separate memory for storing textures.

It is impossible to combine this method of organizing texture storage and storing them in the common RAM of the processor.

If I understood correctly, Amiga and Bally Astrocade used a method with one memory chip, without parallelization (the best case is 1 pixel per 1 cycle via DMA). Textures were stored either in RAM (one chip) or in special blitter memory (a separate chip, but also one).

Were there any blitter implementations based on this parallel principle? Why wasn't this method used in early blitters (Amiga, Bally Astrocade, Xerox Altho)?

I can assume that it's a matter of chip price - one 32kb chip could cost less than 4 8kb chips, or 8 4kb chips.

And I have doubts about the Amiga blitter is serial, I read that it can copy data from 3 DMA sources. But I don’t understand whether it’s parallel from 3xchips at once.

Wiki Blitter
Wiki Bally Astrocade
Amiga Blitter Info

Are the values to be read over the same memory bus as the cpu does , or do you suggest an extra bus for this? — Thorbjørn Ravn Andersen, CommentedDec 5, 2024 at 8:08
When you place all of the video memory (frame buffer, texture memory, sprites) outside of the CPU's address space, you can have fully independent access (and you end up with TI's VDP architecture) — tofro, CommentedDec 5, 2024 at 8:11
The best case for copying is one word per cycle, however many pixels that may be. Almost always more than one for the example machines. — Tommy, CommentedDec 5, 2024 at 13:45
@Tommy, yes, many of those using blitters (Amiga, Atari STE...) had 16-bit busses, and many graphics modes were palette-based, so 1-6 bits per pixel on Amiga, 1-4 bits per pixel on Atari STE, which means anywhere between 2 and 16 pixels per cycle. On the other hand memory (bus) accesses took several cycles (2 on the Atari STE). — jcaron, CommentedDec 5, 2024 at 18:06

John Dallman · Accepted Answer · 2024-12-06 23:58:05Z

Several misconceptions about computers of the 80s using a blitter are leading you to wrong conclusions:

Blitter as a custom chip was not a cheap design, it was extremely complex and expensive to do it right (only Amiga was able to do it in consumer hardware in the first half of the 80s).
Blitter is usually memory (DMA) and frequency bound, so there is a natural limit on how many blitter accesses you can squeeze into the CPU's timeframe. Amiga could use all DMA channels during video output, leading to CPU starvation. As the frequency of the system is limited to single-digit MHz, there is also no room for more DMA channels.
The Amiga and Atari ST were bitplane-based, not chunkypixel-based. So, in a monochrome display, the blitter could move 16 pixel per cycle (you need to use the 3 input channels for the graphic data, the background, and a pixelmask: depending on the pixelmask you decide to use the pixel of the graphic data or the background). You need to blit multiple bitplanes for more colors (up to 6 on Amiga; up to 4 on Atari ST).
Consumer hardware had usually a shared memory layout where CPU and custom chips were sharing the same memory, since this was cheaper to produce.

To summarize: a second blitter would strain the small number of DMA channels available, it would be prohibitively expensive as a second chip, and the pixelformat does not allow your access scheme to execute different pixels in parallel. Also, a dedicated memory architecture was too costly for the 80s generation of 16-bit machines.

"Consumer HW had usually a shared memory layout where CPU and custom chips were sharing the same memory"... including the video chip, which could easily use half of the available bandwidth, though not necessarily all the time. — jcaron, CommentedDec 5, 2024 at 18:08
I would think the biggest opportunity for improvement given the Amiga's overall memory architecture would be to treat the memory as being divided into 32-bit chunks, and allow aligned 32-bit loads and stores to be performed using a single /RAS transaction. Instead of each 16-bit access requiring two system clock cycles, pairs of accesses would take three. That would cut video bus bandwidth usage by almost a third, and cut the number of bus cycles required to perform large blitter operations by a third as well. When using e.g. 32-color halfbright mode, ... — supercat, CommentedDec 5, 2024 at 21:36
...the number of cycles available per line would increase from about 205 to 265, while the number of cycles per pair of transfers would drop from 4 to 3. — supercat, CommentedDec 5, 2024 at 21:39
It doesn't have a Blitter but the Acorn Archimedes is heavily oriented around page-mode accesses — as well as the CPU natively signalling sequential versus random accesses (and being much more likely to perform sequential ones per its RISC load/store nature), video and audio always take the bus for long enough to grab several words at a time and then proceed from there via their own buffers. The thing is really optimised for bandwidth versus RAM cost. — Tommy, CommentedDec 5, 2024 at 21:56
@CraigEstey: I found it: amigadev.elowar.com/read/ADCD_2.1/Hardware_Manual_guide/… "If you specify four high resolution bitplanes (640 pixels wide), bitplane DMA needs all of the available memory time slots during the display time just to fetch the 40 data words for each line of the four bitplanes (40 * 4 = 160 time slots). This effectively locks out the 68000 (as well as the blitter or Copper) from any memory access during the display, except during horizontal and vertical blanking ." A similar scenario can occur on displaying 6 BitPlanes and using the blitter — Peter Parker, CommentedDec 9, 2024 at 10:19

Zac67 · Accepted Answer · 2024-12-06 22:22:43Z

the best case for copying speed is 1 pixel per 1 cycle

With a 16-bit bus, yes. First clock: read two pixels, second clock: write two pixels => two pixels copied in two clocks.

With a 32-bit bus, you could double that throughput (which is what AGA did for Lisa and CPU DMA). Quadruple for 64 bit, and so on. Modern high-end GPUs use 256+ bit busses for a reason.

But the copying process can be parallelized.

Yes, exactly - just a matter of bus width.

These 32 KB are divided into 4 blocks of 8 KB.

That doesn't work unless each block has its own, independent bus. Four 16-bit busses equal one 64-bit bus effectively. Each bus can only run a single operation at any one time.

Were there any blitter implementations based on this parallel principle?

Your "parallel principle" doesn't exist. Generally, it's all about bus width and clock speed. The systems you describe did work "in parallel" (in contrast to contemporary DSPs which mostly worked in serial).

(Of course, four 16-bit busses are not the exact same as a single 64-bit bus, but their maximum throughput is. Four busses allow for up to four independent operations which a single bus doesn't.)

Jerry Coffin · Accepted Answer · 2024-12-06 07:40:35Z

IBM EGA card, read mode 1/write mode 1.

The EGA card arranged its memory into 4 planes. All four planes were at the same range of addresses (normally starting from 0xA0000).

In the card's read mode 1, a read from an address in the card's range would read the byte at that address from each of the 4 planes into 4 latches on the card. In write mode 1, a write to an address in the card's range would write a byte from each of the four latches into the specified address in each of the four planes.

Note that in this mode, the data returned to the CPU from the read is useless, and the data coming from the CPU for the write is ignored--only the read and write addresses are actually used. The data involved is read from memory to latches on the EGA card, and written from the latches to the memory.

In most modes, the EGA supported 16 colors, so a pixel was composed of one bit from each plane. It did also support monochrome mode, which used only 1 bit per pixel (so a byte from each plane was 32 pixels).

With the EGA in read mode 1 and write mode 1, you could do bit blitting, 8 or 32 pixels at a time (depending on the display mode).

Usually by "Blitter" in the context of this one means something that can work with DMA independent from CPU reads and writes, and then you run into the bandwidth issues described in the other answer. — dirkt, CommentedDec 6, 2024 at 14:04
"Blitter" comes from "Bit blit" means some logical operations on the destination according to the input, for example an XOR for fast undoing, or some recombining of different sources for masking. It is a verbatim copy only in the simplest cases. — Peter Parker, CommentedDec 6, 2024 at 15:23
@PeterParker: The EGA/VGA accommodate some of the more complex cases as well (in mode 2 and 3). But the question only mentioned copying, so that's all I discussed in the answer. — Jerry Coffin, CommentedDec 6, 2024 at 15:27
'Blit' in my opinion comes from the PDP-6 'Block Transfer' instruction, opcode 251 octal, mnemonic BLT, pronounced 'blit'. This operated within addressable memory, so was not specifically graphics-oriented, nor did it do anything other than transfer. But it is likely an ancestor of the 'bit blit'. — dave, CommentedDec 6, 2024 at 18:27
@JerryCoffin But I've never heard of anybody actually using the DMA controller on a PC for this, while on the Amiga and on more modern PC cards the Blitter transfers were all CPU independent. Wikipedia says "and in parallel with the CPU, while freeing up the CPU's more complex capabilities for other operations", and the Xerox Alto Blitter was indeed parallel to the CPU execution. So at least for me, the EGA latch "trick" wouldn't count as a "blitter". — dirkt, CommentedDec 6, 2024 at 21:06

Stack Exchange Network

Have there been any parallel blitter implementations?

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Have there been any parallel blitter implementations?

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions