Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲A Tiny Compiler for Data-Parallel Kernels (healeycodes.com)

50 points by healeycodes 2 days ago | 5 comments

RossBencina 21 hours ago [-]

> Can different iterations of this loop run independently?

I'm not a compiler guy, but vectorisation algorithms typically analyze loop-carried dependencies and can vectorise loops that are not trivially data parallel as is the case in the post. Allen & Kennedy (MK, 2002) discusses the classical methods.

Here's an example, I'm not sure whether the post's algorithm would handle it:

  phase = 0.0
  inc = 0.12345678899

  for i in [0, n):
    out[i] = table[phase % TABLE_LEN]
    phase = phase + inc

Assuming n%4 == 0, this can be trivially 4 lane vectorised as:

  phasev = [phase, phase+inc, phase+inc+inc, phase+inc+inc+inc]
  incv = [inc, inc, inc, inc]

  for i in [0, n) step 4:
    out[i..i+3] = table[phasev % TABLE_LEN]
    phasev = phasev + incv

When I read Allan and Kennedy my impression was that vectorising arbitrary imperative code is a much harder problem than designing a language that only allows for vectorisable constructs to be expressed in the first place. For example maybe it's better to express trivially data parallel kernels as pure functions over buffers and buffer indices. That's how shader programs work isn't it? In my example that would produce different code, requiring a multiplication:

  lambda i: table[(phase0 + i*inc) % TABLE_LEN]

healeycodes 17 hours ago [-]

You're right that my post's algo does not handle this case (it's more of a toy to explain the basic cases).

> When I read Allan and Kennedy my impression was that vectorising arbitrary imperative code is a much harder problem than designing a language that only allows for vectorisable constructs to be expressed in the first place.

Yeah, I buy this.

> In my example that would produce different code, requiring a multiplication:

  lambda i: table[(phase0 + i*inc) % TABLE_LEN]

I think some compilers may lower this without the multiplication.. (e.g. turning it back into an induction variable) but with floats they may not be allowed to, since repeated addition and "phase0 + i*inc" are not strictly equal.

rapatel0 21 hours ago [-]

Look into the Eigen library. They use template meta programming to chain linear algebra operations in a way that the compiler should be able to optimize memory layout and kernels for vector instructions. Might give you some ideas.

https://libeigen.gitlab.io/

Though you can expect very verbose compiler output. (I had 35 pages of compiler output output for a single type error once). Probably Nbd with llms.

tovej 14 hours ago [-]

I've also had terrible performance with Eigen when compared to LAPACK. Would not recommend unless you want to spend time profiling performance issues.

mgaunard 20 hours ago [-]

essentially this is a mini-ISPC?

Rendered at 21:42:03 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.