
- 155 DECODER LOGICWORKS 64 BITS
- 155 DECODER LOGICWORKS FULL
- 155 DECODER LOGICWORKS CODE
- 155 DECODER LOGICWORKS PLUS
155 DECODER LOGICWORKS CODE
The zlib-dougallj code is on Github, under the zlib license, but has not been thoroughly tested.

There’s a lot still to be explored, like putting pairs of literals with short codes in the decode table (huge win for some data, as we can read two literals for the price of one), and Fabian’s suggested improvements to table building (much nicer, but I’m still getting my head around the transpose, and need to decide how it’ll work with multi-level tables). My code reaches ~5.28x speed compared to zlib-madler, ~2.30x compared to zlib-apple, ~1.64x compared to zlib-cloudflare, and ~1.10x compared to libdeflate (for now). This looks a bit less impressive, as libdeflate is not only included in the graph, but has also improved significantly since my last post. With this change (and a smaller, probably Apple-CPU-specific chunkcopy change), my experimental zlib fork has improved ~8% since the last post, requiring me to rescale my graph: I believe this is a new discovery, but please let me know if there’s prior art I should credit. This could read a worst-case 48-bit length/distance pair, while still having 16-bits left in bitbuf, for use by the first table-lookup of the next iteration. Specifically this could work nicely for a simpler DEFLATE decoder that unconditionally refills after each literal or length/distance pair (e.g. So for unconditional-refill you get 8 bits of slack (as we must assume bitcount could be as low as 56), but for code that branches on bitcount you only get 1 bit of slack (as bitcount after refill may be as high as 63, reflecting all but one bit in bitbuf).
155 DECODER LOGICWORKS 64 BITS
However, we actually fill bitbuf with 64 bits of data each refill, and we can rely on those extra bits when we’re using last-iteration’s bit-buffer, because we aren’t using last-iteration’s bitcount. This is because we refill such that 56 <= bitcount <= 63, so reading 57-bits would cause this to wrap around. In a branch-free decode loop using this strategy, we can read at most 56 bits before refilling.
155 DECODER LOGICWORKS FULL
(If we have less than 10 bits after decoding a length/distance pair, I actually do a full refill before the optimised refill, which doesn’t make much sense, but it’s rare enough that that’s what happened to work.)

In my unstable, experimental zlib fork, this change translates to ~1.06x speed, requiring some extra logic to make sure we don’t run out of bits. Low enough that I’m convinced that there isn’t a 7-cycle latency chain I’ve overlooked, but high enough that it’s still clearly beneficial to unroll to avoid refilling. Rough real-world measurements give ~6.77 cycles, down from ~8.33 cycles. In theory, this should have 6 cycle latency. can have more unrolled 6c decodes as usual: Pending = bitbuf | (read64LE(bitptr) > 3) & 7) By the time the table-lookup is complete, the refill will have completed, and the refilled bit-buffer can be used in later operations, such as the shift. It can instead use the data from the old bit-buffer. If we refill while the bit-buffer still has enough bits for the following table-lookup, that table-lookup needn’t wait for the refill. Surely 7-cycle latency was possible? After that time, the idea was surprisingly simple: don’t let the bit-buffer get empty. I’d spent some time staring at the old loop, trying to save a cycle before writing that post: // 2-cycle refill: Since then, I figured out a (possibly novel) strategy for a ~20% faster loop, by taking the refill off the critical path entirely, translating to ~1.06x DEFLATE decompression speed.

155 DECODER LOGICWORKS PLUS
This had a 6-cycle decode, plus only 2-cycles of refill latency on the critical path – much better than the naive approach. Furthermore, any queries regarding this article or electronics projects you can comment us in the comment section below.In my recent post on optimising zlib decompression for the Apple M1, I used a loop that refilled a bit-buffer and decoded a huffman code each iteration, based on variant 4 from Fabian Giesen’s Reading bits in far too many ways (part 2). Therefore, this is all about the 4 to 16 decoder circuit design using a 3 to 8 decoder circuit.

In every wireless communication, data security is the main concern.Circuit Diagram of 4 to 16 Decoder 4 to 16 Decoder Circuit Applications of Decoders
