May 13 2015

Fast legoization

Optimizing "legoizer" web worker application to run 10 times faster.

I saw legoizer application at ScotlandJs 2015 in Edinburgh. The presentation by Amy Wibowo @sailorhg and Matt Baker @reissbaker showed a hackathon project they implemented: drop any image into the website to convert into a valid Lego combination of flat pieces.

Starry Night original

Starry Night lego

You can try the conversion yourself at the original website http://sailorhg.github.io/legoizer/ using this image.

After the conversion is done, the app even gives you the list of Lego pieces to buy. Not surprisingly, the "Starry Night" will need a lot of blue pieces (this is just the start of the list)

Starry Night shopping list start

I thought the conversion was really cool but a little bit slow. Even for a small image (the above "Starry Night" image is only 600 by 500 pixels) the website pauses for a second or two. Why can't the conversion process spit the result quickly? I decided to investigate in what I call "drive-by optimization".

The original code

The original code is not on github yet, but the unminified browserified code is available at the original website and can be used to investigate the performance issues. The code contains two main bundles: index.js and worker.js. Both were built from the same source files, thus they have common source parts. For our purpose the algorithm only has several steps.

Send the image to a web worker
- The webworker will figure out for each pixel the closest Lego color (out of 24 widely available colors you can buy in bulk.)
The web worker sends the converted pixels back to the main thread.
The main thread figures out how to position blocks to cover the specific colors.

The standard 24 colors you can buy in bulk are simple:

var palette = {
    red: {R: 196, G: 40, B: 27 },
    green: {R: 40, G: 127, B: 70 },
    blue: {R: 13, G: 105, B: 171},
    black: {R: 27, G: 42, B: 52},
    ...
    darkAzure: {R: 20, G: 152, B: 215},
    coolYellow: {R: 253, G: 234, B: 140}
};

In the first step of the algorithm, we need to find for each pixel from the given image the closest Lego color. The team used perceptual difference metric CIEDE2000. This metric finds a Lego color that "looks" close to any given color. While accurate, the difference formula is very computationally expensive, involving multiple trigonometric functions and square roots. For example, the following code forms the last few lines of the implementation

last lines of the CIEDE2000 JavaScript implementation

var T = 1-0.17*cos(radians(a_hp-30))+0.24*cos(radians(2*a_hp))+
  0.32*cos(radians(3*a_hp+6))-0.20*cos(radians(4*a_hp-63));
var d_ro = 30 * exp(-(pow((a_hp-275)/25,2)));
var RC = sqrt((pow(a_Cp, 7.0)) / (pow(a_Cp, 7.0) + pow(25.0, 7.0)));
var SL = 1 + ((0.015 * pow(a_L - 50, 2)) /
              sqrt(20 + pow(a_L - 50, 2.0)));
var SC = 1 + 0.045 * a_Cp;
var SH = 1 + 0.015 * a_Cp * T;
var RT = -2 * RC * sin(radians(2 * d_ro));
var dE = sqrt(pow(dLp /(SL * kL), 2) + pow(dCp /(SC * kC), 2) +
              pow(dHp /(SH * kH), 2) + RT * (dCp /(SC * kC)) *
              (dHp / (SH * kH)));

I believe the team took the color difference code from markusn/color-diff repo.

To me, this part looked most expensive, but I needed to prove it by profiling the code first.

Profiling code

I saved the entire legoizer to a local drive and started hacking. First I added timestamps and profile collection to the index.js file and measured end to end color conversion.

index.js

function processImage(err, image) {
  if(err) return;
  console.time('total'); // start timer
  var canvas = canvasFromImage({
        image: image,
        maxSize: MAX_SIZE
      }),
      context = canvas.getContext('2d');
  var bmp = bitmapFromContext(context, canvas.width, canvas.height);
  ...
  WORKER_EMITTER.once('end', function(data) {
    progressView.end(function() {
      instructionsView.render({ // see render function below
        canvas: canvas,
        context: context,
        inBmp: bmp,
        outBmp: Bitmap.fromJson(data)
      });
      done();
    });
    WORKER_EMITTER.removeListener('progress', progressView.update);
  });
}
exports.render = function(opts) {
  console.timeEnd('total');
  ...
};

I placed the console.time and console.timeEnd around all image processing steps, including converting from / to the JSON format. I used Chrome Canary 44.0.2399.0 to measure how long the the code runs in a clean environment without any additional extensions running. The initial code took 2361ms to execute end to end. I also added timestamps to the worker.js to measure how long the image conversion takes without the image transfer costs between the main thread and the web worker (and back)

worker.js

function convert(e) {
  var input = Bitmap.fromJson(e.data);
  console.time('pixelMap');
  var output = pixelMap(input, convertToPalette, 5, 6);
  console.timeEnd('pixelMap');
  postMessage({
    type: 'end',
    data: output.toJson()
  });
}

The timestamp showed that the function pixelMap ran for 1342ms - 50% of the total! This seemed like a good candidate for further profiling. I added console.profile and console.profileEnd next to console.time calls. The same run generated a detailed CPU profile chart. The longest running functions are on the top of the list

The function that calls most of the longest running utilities is the following

module.exports = function(bitmap, mapper, xUnit, yUnit) {
  var r, c,
      aspect = yUnit / xUnit,
      outFragment = new Fragment,
      rows = (bitmap.height / (yUnit / xUnit)) | 0,
      cols = bitmap.width,
      out = new Bitmap(cols, rows, bitmap.channels);
  for(r = 0; r < rows; r++) {
    for(c = 0; c < cols; c++) {
      mapper(bitmap, c, r, aspect, outFragment);
      out.setFragment(c, r, outFragment);
    }
    postMessage({
      type: 'progress',
      data: r / (rows - 1)
    });
  }
  return out;
};

This literally goes through each pixel, calling mapper function and saving the result. The top 3 bottlenecks were ciede2000, xyz_to_lab and rgb_to_xyz functions, taking roughly the same amount of time. Usually I advise to concentrate on the top bottleneck only, but in this case any one would do.

Eliminate some conversions

At some point the algorithm needed to compare two colors. Turns out the code to find the difference was really weird

color.closest = function(target, relative) {
    var key = color.palette_map_key(target);
    var result = color.map_palette([target], relative, 'closest');
    return result[key];
};

The target was a single RGB pixel, while relative is an array of 24 RGB Lego colors. The weird part was making an array with single color [target] and then iterating over two arrays to find closest pairings inside color.map_palette function

color.map_palette

function map_palette(a, b, type)
{
  var c = {};
  type = type || 'closest';
  for (var idx1 = 0; idx1 < a.length; idx1 += 1){
    var color1 = a[idx1];
    var best_color      = undefined;
    var best_color_diff = undefined;
    for (var idx2 = 0; idx2 < b.length; idx2 += 1)
    {
      var color2 = b[idx2];
      var current_color_diff = diff(color1,color2);
      if((best_color == undefined) || ((type === 'closest') && (current_color_diff < best_color_diff)))
      {
        best_color      = color2;
        best_color_diff = current_color_diff;
        continue;
      }
      if (type === 'farthest') {
        ...
      }
    }
    c[palette_map_key(color1)] = best_color;
  }
  return c;
}

Since we always find closest color for a single RGB value, we can eliminate one iteration easily. Also, we are interested in the closest color, thus we can remove the if (type === 'farthest') branch. I made the code changes and ran the end to end timing again. The numbers changed to the following, showing slight improvement

step                |  pixel map function (ms) |  total (ms)
--------------------------------------------------------
initial             |          1342            |    2361
removed inner loop  |          1236            |    2280

Ok, we are only getting started!

Precompute 24 Lego Lab colors

We are using CIEDE2000 difference algorithm that operates on Lab colors. Thus every RGB triple must be converted to Lab color triple. This is an expensive operation, as you saw in the first profile. It involves first converting from RGB to a special XYZ color space and then to Lab color space.

function rgb_to_lab(c)
{
  return xyz_to_lab(rgb_to_xyz(c))
}
function rgb_to_xyz(c)
{
  // Based on http://www.easyrgb.com/index.php?X=MATH&H=02
  var R = ( c.R / 255 );
  var G = ( c.G / 255 );
  var B = ( c.B / 255 );
  if ( R > 0.04045 ) R = pow(( ( R + 0.055 ) / 1.055 ),2.4);
  else               R = R / 12.92;
  if ( G > 0.04045 ) G = pow(( ( G + 0.055 ) / 1.055 ),2.4);
  else               G = G / 12.92;
  if ( B > 0.04045 ) B = pow(( ( B + 0.055 ) / 1.055 ), 2.4);
  else               B = B / 12.92;
  ...
}
function xyz_to_lab(c)
{
  // Based on http://www.easyrgb.com/index.php?X=MATH&H=07
  var ref_Y = 100.000;
  var ref_Z = 108.883;
  var ref_X = 95.047; // Observer= 2°, Illuminant= D65
  var Y = c.Y / ref_Y;
  var Z = c.Z / ref_Z;
  var X = c.X / ref_X;
  if ( X > 0.008856 ) X = pow(X, 1/3);
  else                X = ( 7.787 * X ) + ( 16 / 116 );
  if ( Y > 0.008856 ) Y = pow(Y, 1/3);
  else                Y = ( 7.787 * Y ) + ( 16 / 116 );
  if ( Z > 0.008856 ) Z = pow(Z, 1/3);
  else                Z = ( 7.787 * Z ) + ( 16 / 116 );
  var L = ( 116 * Y ) - 16;
  var a = 500 * ( X - Y );
  var b = 200 * ( Y - Z );
  return {'L' : L , 'a' : a, 'b' : b};
}

Any time you do powers or in this case cubic roots, there is a huge performance cost. Can we convert from RGB to Lab fewer colors? Turns out yes, and quite easily. I have noticed that we convert the 24 Lego RGB colors in every pixel comparison.

function findClosest(rgb, legoColors) {
    var color = rgb_to_lab(rgb);
    var legoLab = legoColors.map(rgb_to_lab);
    // find closest color's index betwee color and legoLab array
    return legoColors[closestIndex];
}

Thus I precomputed the 24 Lego Lab colors right away and just passed them around. Only an individual RGB pixel under the investigation was converted. The performance has indeed improved

step                        |  pixel map function (ms) |  total (ms)
--------------------------------------------------------------------
initial                     |          1342            |    2361
removed inner loop          |          1236            |    2280
precompute Lego Lab colors  |           950            |    1938

Caching computed colors

We convert a lot of colors when comparing RGB pixels to Lab colors. For my next optimization I have added caching such conversions. This cache is similar to a cache that already existed in the program (storing closest picked Lego color).

function rgb_to_lab(c)
{
  var cached = rgbToLabCache[c.R][c.G];
  if(cached[c.B]) return cached[c.B];
  var result = xyz_to_lab(rgb_to_xyz(c));
  cached[c.B] = result;
  return result;
}

The impact was very noticeable (but a little suspicious). Unfortunately I did not have time to investigate - it seems the two caches should be pretty aligned, and only one would actually give cache hit. In any case, the color conversion cache has improved the algorithm's speed

step                        |  pixel map function (ms) |  total (ms)
--------------------------------------------------------------------
initial                     |          1342            |    2361
removed inner loop          |          1236            |    2280
precompute Lego Lab colors  |           950            |    1938
cache RGB -> Lab            |           674            |    1679

Passing data to the web worker

Converting an image raster to a JSON string to pass from the main script to a web worker is an expensive operation. We also convert the result when passing the lego colors back from the web worker. One can get a sense how expensive this is by noticing that the total times are around 2x the time it takes to convert the image.

Luckily, modern browsers support Transferable objects. One thread can pass a reference to an ArrayBuffer (or its subclass) directly. The key observation is that this buffer is not a shared memory buffer. The caller releases the buffer and does not have access to it anymore. This might not work for other cases, but works very well for ours. We can get an instance of UInt8ClampedArray directly from the canvas object holding the input image.

sending pixel array buffer to web worker

var canvas = canvasFromImage({
      image: image,
      maxSize: MAX_SIZE
    }),
    context = canvas.getContext('2d');
var pixelData = context.getImageData(0, 0, canvas.width, canvas.height);
worker.postMessage({
    pixels: pixelData.data.buffer,
    width: canvas.width,
    height: canvas.height,
    channels: 4
}, [pixelData.data.buffer]);

Note that we list the transferable objects, but they could be properties in the data object to be sent

// single transferable object pixelData.data.buffer
worker.postMessage({
    pixels: pixelData.data.buffer,
    ... // other properties
}, [pixelData.data.buffer]);

Similarly, the web worker can send a transferable object back to avoid the expensive marshalling

from the web worker

var bytes = new Uint8ClampedArray(output.pixels);
self.postMessage({
  type: 'end',
  data: bytes
}, [bytes.buffer]);

The above approach eliminated huge object transfer overhead, aligning the total end to end time with the image color conversion time.

step                        |  pixel map function (ms) |  total (ms)
--------------------------------------------------------------------
initial                     |          1342            |    2361
remove inner loop           |          1236            |    2280
precompute Lego Lab colors  |           950            |    1938
cache RGB -> Lab            |           674            |    1679
transferable objects        |           647            |     699

Splitting work across multiple web workers

The pixels from the input image are processed independently of each other. This points to the last performance optimization: split the entire image conversion to multiple web workers executing in parallel. I chose to split the image horizontally into N regions. For example, here are color coded regions if N = 4.

4 image regions computed by 4 web workers

The code send just a region of the original image to the web worker is very straight forward

var N = 4;
var h = Math.ceil(canvas.height / N);
// while there are rows remaining
for (row = 0; row < canvas.height; row += h) {
  var pieceHeight = Math.min(h, canvas.height - row);
  var pixelData = context.getImageData(0, row, canvas.width, pieceHeight);
  var bytes = pixelData.data.buffer;
  // send bytes to the web worker
}

Each web worker sends the mapped pixels back and we can draw them back into the original canvas

// keeps track of finished workers
var pieces = [];
function isDone(x) {
  return x;
}
// when every worker finishes it sends its chunk back
WORKER_EMITTER.on('end', function(data, workerIndex) {
  pieces[workerIndex] = true;
  console.log('worker', workerIndex, 'has finished');
  // handle last irregular number of rows
  pieceHeight = data.length / canvas.width / 4;
  // put converted pixels back into the large image
  var finishedPart = new ImageData(data, canvas.width, pieceHeight);
  context.putImageData(finishedPart, 0, workerIndex * h,
    0, 0, canvas.width, pieceHeight);
  if (pieces.every(isDone)) {
    // render converted image as Lego pieces
    var finishedData = context.getImageData(0, 0, canvas.width, canvas.height);
    var results = bitmapFromContext(context, canvas.width, canvas.height);
    instructionsView.render({
      canvas: canvas,
      context: context,
      outBmp: results
    });
  }
});

The performance gains when splitting into 4 regions were 3x (there is always some overhead when running multiple tasks in parallel)

step                        |  pixel map function (ms) |  total (ms)
--------------------------------------------------------------------
initial                     |          1342            |    2361
remove inner loop           |          1236            |    2280
precompute Lego Lab colors  |           950            |    1938
cache RGB -> Lab            |           674            |    1679
transferable objects        |           647            |     699
4 web workers               |           225            |     250

I tried varying the number of web workers N, but the impact was minimal. Seems there is no point in splitting the work to more than the number of CPU cores (MacBook Pro with 4 cores).

Conclusion

The optimizations I used in this example are very common techniques and are not domain specific

do not compute same information multiple times (cache results)
use modern data passing techniques to avoid expensive data marshalling (transferable objects)
perform independent chunks of work in separate threads (multiple web workers)

The legoizer speed up, while slower than real time 60 frames per second, is almost 10x compared to the original code. Further optimization could be achieved by scaling down the image even more (the original code scales any dropped image to the maximum 200px dimension).

In my opinion even the current speed up almost allows two interesting applications built upon the original idea

convert a gif / movie into Lego-style movie
real time video Lego filter for WebRTC communication

For me personally, the drive by optimizations are an excellent training ground. The lessons learned here will be applied to my day to day work.

I will make my git repo public as soon as the original code is made public (aside from the finished web app).

Optimizing Nth is another example of a drive by optimization of 3rd party library
my other performance blog posts

Better world by better software

Gleb Bahmutov PhD

Our planet 🌏 is in danger

Act today: what you can do