SIMD Within A Register

How can we use SIMD instructions to operate on integer data that appears in sizes the instructions don't understand. There are potential applications for this outside of just Parabix. SWAR could enable using the smallest possible bit-width to encode data with much less hassle. E.g. DNA base pairs can be encoded as a 2 bit number.

Considerations must be made for arithmetic operations because the size of their output can differ from that of their input. In the case of addition for example, care must be taken since overflows will corrupt other data in the register. Similar considerations must be made for other operations.

In its current state, LLVM treats unsupported vector widths atrociously.

Introductory Example

Let's examine the simplest problem: bit vector addition.

What we have:

define <128 x i1> @addi1 (<128 x i1> %v1, <128 x i1> %v2) {
entry:
    %out = add <128 x i1> %v1, %v2
    ret <128 x i1> %out
}

What we get:

53KB of ASM

What we want:

define <128 x i1> @addi1 (<128 x i1> %v1, <128 x i1> %v2) {
entry:
    %v3 = reinterpret_cast <128 x i1> %v1 to <16 x i8>
    %v4 = reinterpret_cast <128 x i1> %v2 to <16 x i8>
    %v5 = xor <16 x i8> %v3, <16 x i8> %v4
    %out = reinterpret_cast <16 x i8> %v5 to <128 x i1>
    ret <128 x i1> %out
}

End goal

Achieve similar results for all common operations (arithmetic and logic) on duos and nybbles.

Further Reading

Fisher, 2003: Randall Fisher's dissertation is the SWAR bible. Coins the term SWAR and provides some optimizations. It was written 15 years ago, but the essence of the topic still lies within.

SWAR and SIMD Techniques: A general introduction to SIMD and SWAR. Includes a list of other publications on the subject of SWAR and SIMD code generation. Apparently, chess engines use SWAR methods to vectorize operations on nybbles and duos.