6

In Postgres, it used to be quite common to use a 4-byte integer auto field for primary keys, until it started becomming somewhat common to run into the 2147483647 limit of 4-byte integers. Now, it's become somewhat common to start with 8-byte integer primary keys, just to be safe, because the small additional cost helps avoid potentially major issues down the road. But, who will ever have a database with 9223372036854775807 (9 billion billion) rows?

This made me wonder why 6-byte integers weren't really a thing. A signed 6 byte integer could hold something like 140 trillion positive values (right?), and would be highly practical for uses like this (database primary keys).

I found this old post, dating back to 2004, asking about 6-byte integers in C. I also found a number of specific technical questions on StackOverflow related to 6-byte integers, but I couldn't find anything well documented or standardized that either A) suggested that 6-byte integers are a common thing, B) should or might become a thing, or C) can't, for some technical reason, become a thing.

Is there any reason why it wouldn't be a good idea, from a performance, practicality, etc. standpoint to have 6-byte integers? Feel free to think about this question in the limited context of C and/or C++, if it helps it avoid being too abstract.

My gut tells me that wide-spread adoption of 6-byte integers would present some kind of big performance boost / RAM savings for a lot of projects, especially in the ML/AI space, as well as some small performance / RAM / space savings in the database world.

8
  • Why are you asking about 6 bytes, and not 5? Why not 7?
    – Flater
    CommentedNov 15, 2024 at 10:53
  • @Flater 48 bits (6 bytes) is the halfway between 2 common types, namely 32 bit and 64 bit integers, providing half the differential benefits and detriments of each.
    – orokusaki
    CommentedJan 12 at 16:51
  • Careful with your definition of "half". The benefit here is the size of the number you can store. A data type that can store half of what a 64 bit integer can store is a 63 bit integer, not a 48 bit one. But more to the point, my question wasn't asking for you to explain that 6 is the midpoint, but rather to imply that 6 seems to be an arbitrary number with no real basis in binary (it's not a power of two) - making it no less arbitrary than 5 or 7.
    – Flater
    CommentedJan 12 at 23:21
  • @Flater - "half of the differential benefits and detriments", not half the number. For example, 8 bytes is the largest in the series we are discussing and has the most benefit, while 4 bytes is the smallest and has the least benefit, in terms of possible uses. 6 is the halfway.
    – orokusaki
    CommentedJan 14 at 2:09
  • The benefits do not increase linearly, they increase exponentially. "Half" is only in the exact middle for a linear increase. Expressing the benefits of the size of a number type as a linear function makes no sense - doubling the amount of bytes does not double the benefit. What you're doing is the same as arguing that inbetween one thousand (4 digits) and one billion (10 digits), the halfway point is one million (7 digits) because it has 7 digits. But that's not relevant when discussing the value that these numbers express.
    – Flater
    CommentedJan 14 at 2:10

4 Answers 4

15

6 byte integers present a couple of issues related to alignment. A 6 byte integer needs to have an alignment of 2 or less. An alignment of 4 or 8 would result in needing 2 packing bytes in arrays, negating any density advantages over 8 byte math in arrays.

This presents a problem with cache lines. Conventionally the alignment is at least as large as the value, to ensure that the value will not cross a cache line boundary. Crossing a cache line boundary significantly increases both the complexity of the hardware to handle the read (it has to split the read into two transactions, one of 2 and one of 4, each of which could fail), and doubles the latency, as that now needs two cache bus/memory bus transactions rather than one.

That cost likely more than outweighs the 25% improvement in memory density you get from packing in additional values. And in situations where this is worthwhile to reduce main memory bandwidth, the cost of unpacking manually then doing the math in 64bits is unlikely to be a bottleneck.

And for situations where you aren't dealing with arrays, its rare to be dealing with enough copies that 6 vs 8 bytes is significant, and latency from single accesses would likely be dominated by the misalignment costs.

18
  • 3
    @freakish Because addresses are in binary. Which means you can address aligned cache lines by dropping off the bottom n bits of the address bus when doing a memory access. And if cache lines are aligned to a power of 2 boundary, then everything else needs to align that boundary.CommentedNov 14, 2024 at 13:00
  • 2
    @freakish Whats important is that the cachelines/words are a power of two of the minimum addressable unit. Most modern machines have bytes as the minimum addressable unit, and so have power of two sized cachelines/words. Mainframes like the ICL 1900 were word addressable, and so they had 1 24 bit addressable unit per word. And 1 is a trivial power of 2. So if you are willing to have your base addressable unit be a multiple of 3 bits, it is doable, but that comes with a whole lot of bigger compromises when it comes to data interchangeability.CommentedNov 14, 2024 at 14:47
  • 3
    Its the bus geometry. For an m bit address you need m lines. For an m bit address aligned to the nth power of the base, you just take the top m-n lines, and that's the bus address. Any other scheme for calculating bus addresses is more complex and requires more lines.CommentedNov 14, 2024 at 16:39
  • 3
    @freakish Simplicitly is absolutely a goal of modern architectures. They are designed to be the simplest possible design that meets all the performance and compatibility requirements. Because complexity makes performance harder and reduces yields. If there is a solution that is simple, zero power, and zero latency? You better have a good reason for doing something else.CommentedNov 14, 2024 at 16:50
  • 2
    @freakish base 3 is annoyingly difficult to design for, because it just doesn't match how transistors work. You only get 2 saturation regions. Base 4 gets some interest, because you can work with a pair of transistors, but tends to only be seen in places where the density of signals massively outweighs the cost of processing hardware (storage, external highbandwidth interfaces etc).CommentedNov 14, 2024 at 23:27
4

There are two separate issues here:

  • Memory usage
  • Processor support for 6 byte integer math.

These issues are somewhat independent, in that a lot of the time numbers are just shuffled around between variables and data structure (no arithmetic operations are performed on them) so even without processor support a language could be designed to only store integers in 6 bytes (inside structures), for one off variables you could still use 8 bytes and use 8 byte arithmetic functions (although you might need extra code for overflow conditions).

If you add 6 byte math support to the processor it would likely have a significant impact on the design of the processor as there would need to be additional instructions added to the processor and this may have downstream effects on other parts of the processor (for example cache lines). I doubt there is a significant enough ROI (extra cost of both design and manufacturing) to justify the extra complexity, given the potential for slightly fast execution of code, even assuming a perfect implementation you are only going to gain a 33% improvement (6 vs 8 bytes moved) in performance for the cases that can benefit from it.

I would also question how much we as programmers want to think about different sizes of integers (premature optimization argument), in that I usually just use Int (32 bit) ** even if byte or short might work, I typically only switch to long for really big numbers or byte if I am processing binary data.

** - I would consider short or byte, if I am storing a lot of them (i.e. allocate a large array of short).

1
  • Thanks for all the info - I ultimately picked the other answer, because it gave a more concrete set of reasons for the lack of standard support, but this is also a terrific answer. I wish I could choose both.
    – orokusaki
    CommentedNov 15, 2024 at 14:02
0

The other answers about the convenience of power-of-two sizes being simpler for addressing are undoubtedly correct, but it's perhaps worth pointing out that there are 6-byte integers in common use, for some value of "common".

Specifically, IEEE 802 MAC address are 6 bytes, which means 6-byte values are used on every Ethernet and Wi-fi packet in the world.

Non-standard (ie, non power-of-two) sizes are always possible. It's just that they're preferred in situations where space is more important than simplicity. It's also relevant that although these really are integer values, they are mostly not used for integer arithmetic.

1
  • There also not unusual to see in DSP architectures, where processing 32 bit samples, without loss of precision in intermediary operations is needed, but large memory spaces and the ability to program in higher level languages are not considered important.CommentedNov 17, 2024 at 0:23
-1

It’s historical. When hardware manufacturers could improve on 32 bit numbers, they checked that 64 bit was more of a long term solution and easier to implement. For example array indexing can be done by shifting the Index by three bit positions instead of using a multiplication by 6. 8 bytes can be simulated by using 2x4 bytes.

There’s not much difference but some, and you don’t want both 48 and 64 it hardware. Which means 48 bits makes you fall behind the competition.

1
  • From a hardware point of view, shift by 3 is a lot cheaper than multiplication by 6. Shift by 3 can be done trivially by arranging the address line interconnects if you know its a fixed shift. Multiply by 6 however requires a bunch of logic to calculate ((x << 1) + x) And division by 6 requires a divider.CommentedNov 14, 2024 at 13:04

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.