I'd start with this though:
Well actually you do know what's happening inside: datasheet shows that both clock and clock inhibit are thrown into a NOR gate, and thus are interchangeable.PokeMon wrote:Anyway you connect /CE (or Clkinhibit) together with CLK of the shift register which is very strange and may cause some side effect as well.
So when the clock is rising it should move the shift register while the clk is inhibit as soon as it is high.
You never know whats happening inside.

Issue here derives from the '165 used: its parallel load operation is asynchronous. That is: the 8 internal bits can be updated any time regardless of where in the clock period you are.
Better would be to use a '166 here. Its load operation is synchronous, that is: the 8 internal bits are always updated at a clock edge. When the "load" input is active, they're updated with the parallel input data. If not, the already-present bits are shifted. Either way, serial output is updated @ a fixed, regular interval.
You have one constraint with both '165 or '166: presenting the input data around the time that the last bit of a character (0) 'runs out' and the first bit of a new character (7) is needed. Since you have a time window in the Z80 "execute NOP" cycle, there is some freedom in choosing where in that window the parallel load occurs. Read: @ some edge of the pixel clock where ROM access time has been fullfilled and data bus still has the pixel data on it.
With the '166, all that's needed next is making sure that "load enable" is active leading up to that point where the bit 0->7 clock occurs. And perhaps a short while after ("data hold time"). It's not that hard to meet that condition, and result is a perfectly regular stream of bits. No "re-synchronisation" needed.
With the '165 however, you also need to make sure that the "parallel load" signal occurs exactly at the same time that clock edge arrives. The latter is a much harder condition to meet, imho. Too early, or too late, and you'll have pixels that are stretched or chopped up - which is what you're seeing here. A tiny bit too early or late, and the problem can be small - but you'll still have it.
Personally I would go & draw a timing diagram here, with all clock & relevant signals in it, datasheets of the logic IC's in hand, to get a good view of the sequence of events. And then re-organise that shift register load mechanism as needed (going with a '166 if it doesn't require too much rework).