Joyent Manta Storage Service: Image Manipulation and Publishing Part 3 - Fast Resizing with Manta Compute Jobs

October 08, 2013 - by Christopher Hogue, Ph.D.

This Blog covers how to use the ImageMagick resize commands on the Joyent Manta Storage Service. I get a 4.7x speedup on Manta when converting images that are 0.25 Megapixel into thumbnails compared to my SSD drive equipped notebook. For the giant-sized Getty Open image originals I get a 72x speedup resizing them down to 0.25 Megapixels on Manta.

In Part 1 I introduced the Getty Open Image set we are computing on in this how-to. If you followed along with Part 2, you should be all set for this installment. If not, I have provided a catch-up script for the demo of thumbnail creation on Manta, so you can forge ahead!

Here you will:

  1. Create thumbnails on your local machine using the Unix find -exec construct.
  2. Create thumbnails on Manta using the mfind mjob mpipe pattern.
  3. Learn how to share your Manta job example with mjob share.
  4. Learn how to resize the very large Getty originals and preserve color like a pro.

Catch up!

If you skipped Part 2, let's get you on track first. Have a look at Trevor O's blog for quick Manta install information and very simple image resize examples. You will need to have the Node.js based Manta Command Line Utilities installed on your local Unix/Linux system.

Then you can run this to populate the files you need on Manta and on your local Unix system to try the next steps:

Making Thumbnails Locally

On your local machine, you can make thumbnails out of the images individually using the convert command like this. If you did Part 2, you probably left your files here ~/var/tmp/500x500_webp/getty, otherwise if you used the script above, it will report at the end where the files are.
Change to that directory and try one convert process:

convert 00000201.jpg -thumbnail 10000@ -strip -quality 95 PNG8:00000201.png

Here we are using the ImageMagick convert command to resize images to a constant area. I use the constant area argument 10000@ because, as we saw in Part 2, the Getty Open image collection comes with a number of different aspect ratios.

To convert the whole batch into thumbnails on your local machine you can execute the convert command with in the Unix find command, as I explained for the One-Machine Unix Map in Part 2.

Here I start with the Unix time command to output the overall time for the entire thumbnail creation process for all 4,596 JPEG images on my notebook. Your notebook will be CPU throttled with individual convert jobs which each spawn multiple threads during this run when you try this:

real    5m13.196s
user    4m51.707s
sys     1m18.814s

And you can see the output

$ ls -al *.* | more
-rw-r--r--  1 cwvhogue  staff   25720 10 Sep 15:14 00000201.jpg
-rw-r--r--  1 cwvhogue  staff    5066 19 Sep 12:05 00000201.png
-rw-r--r--  1 cwvhogue  staff   31520 10 Sep 15:14 00000301.jpg
-rw-r--r--  1 cwvhogue  staff    5623 19 Sep 12:05 00000301.png

So it takes a little over 5 minutes on my Mac notebook (with an SSD drive!) to produce the entire set of thumbnails. But this is a tiny example.

Thumbnails on Manta - A Map Process

Thumbnail creation is, in MapReduce parlance, only a map step.

To create the thumbnails on Manta, you will need the the Manta directory full of JPEG files as set up in Part 2 or by the script above.

So the Manta thumbnail creation command for this set of images is:

The convert command in this case puts a file named simply out.png into the POSIX read/write filesystem you have access to in /var/tmp during the compute job.

At this stage, the out.png file is not on the Manta Object store. To get it there, I use the && mpipe construct. The mpipe command is a Manta version of pipe, and here I use it to direct the POSIX file output of the convert command /var/tmp/out.png to a Manta Object file named the same as the input *.jpg file but with a bit of Unix shell to replace the .jpg with .png.

Here ${MANTA_INPUT_OBJECT%.*} returns the original object file name with the .jpg part removed, and the .png appends the correct file name extension like this.

If you find these bits of shell commands confusing, try them out first like this:

$ export MIO=/foo/bar/12345.jpg
$ echo ${MIO%.*}.png

This is the timed session for thumbnail creation job run on Manta, starting with the 0.25 Megapixel/50% quality JPEG versions of the Getty Open image set:

added 1000 inputs to 6bfce965-aa26-e7c8-8350-8101d41e995f
added 1000 inputs to 6bfce965-aa26-e7c8-8350-8101d41e995f
added 1000 inputs to 6bfce965-aa26-e7c8-8350-8101d41e995f
added 1000 inputs to 6bfce965-aa26-e7c8-8350-8101d41e995f
added 596 inputs to 6bfce965-aa26-e7c8-8350-8101d41e995f

real    1m5.881s
user    0m1.746s
sys     0m0.222s

$ mls -l /cwvhogue/public/getty | head
-rwxr-xr-x 1 cwvhogue         25720 Sep 20 10:09 00000201.jpg
-rwxr-xr-x 1 cwvhogue          4109 Sep 25 16:36 00000201.png
-rwxr-xr-x 1 cwvhogue         31520 Sep 20 10:09 00000301.jpg
-rwxr-xr-x 1 cwvhogue          4668 Sep 25 16:36 00000301.png

Using the real time values above, the speedup here is from 313 seconds to 66 seconds. The user time for the Manta job reflects how much time my notebook spent on the job; 1.74 seconds.
While this isn't a huge case, Manta comes out 4.74x faster than my notebook with SSD drive.

Sharing your Manta mjob with the world

There is a cool Gist-like feature of mjob - the mjob share view. The mjob share feature creates an html page that lists the mjob JSON equivalent, a sample of input and output files, and any errors that occurred. To share your mjob, simply substitute your job code and this will make the html file in your /public/jobshares directory on the Manta Storage Service. Then you can tweet it!

mjob share 6bfce965-aa26-e7c8-8350-8101d41e995f

creates this page:

The Big Resize Job - The Getty Open Content Originals

So let's look at the huge case where I had to resize all the Getty Originals over 100GB of images - the same in number, but much larger in size.

With the full-size Getty Image set, (which are roughly 20MB each, but spanning sizes from 8MB to over 300MB - see Part 1), the ImageMagick convert command to make a single 0.25 Megapixel JPEG image from an original is this:

Here is the test file for the above example:

Preserving Color

Note that the -resize command is surrounded by two -colorspace commands. This construct applies the -resize operation in RGB space then converts it back to sRGB space. This is done to preserve the image colors to ensure they are not lost by the resize operation, which is what would happen if it was left in sRGB space. You can see on the right, that dark color areas are expanded and exaggerated on the resize operation without the -colorspace RGB transformation

On the left is a close-up of the largest Getty Image resized after the -colorspace RGB transformation:

convert 00099001.jpg -colorspace RGB -resize 250000@ -colorspace sRGB -quality 80 00099001_RGB.jpg

On the right is the same image resized without altering the colorspace settings:

convert 00099001.jpg -resize 250000@  -quality 80 00099001_sRGB.jpg

Running Hot With Local Resize of the Getty Originals

To process all the originals, I had to download them to my notebook, then run this command in the directory where I put them, again using the Unix find -exec construct.

So in the case of the originals, convert is on run each file and the find construct processes the list of 4,596 images one at a time. This takes almost four hours to complete on my 2.66 GHz Intel Core i7 Mac notebook. I was not using the SSD drive this time - not enough space. The job uses all four cores of the processor as it is multithreaded. And there are errors reported by convert on some of the image files.

Here is the time it took to run this resize job on 101.6Gb of the original Getty Open Content image data on my notebook:

real    228m23.300s
user    541m11.777s
sys     20m11.780s

Here are the results:

-rw-r--r--     1 cwvhogue  admin   23987315 24 Aug 17:23 00000201.jpg
-rw-r--r--     1 cwvhogue  admin      48161 19 Sep 13:42 00000201_250.jpg
-rw-r--r--     1 cwvhogue  admin   22151707 22 Aug 16:06 00000301.jpg
-rw-r--r--     1 cwvhogue  admin      56954 19 Sep 13:42 00000301_250.jpg
-rw-r--r--     1 cwvhogue  admin   19408035 22 Aug 16:06 00000401.jpg
-rw-r--r--     1 cwvhogue  admin      44910 19 Sep 13:42 00000401_250.jpg

Feel like reproducing this yourself? Here is a Gist for downloading the Getty Open originals.

Resize of the Getty Originals on Joyent Manta

The Manta version of the ImageMagick resize command operating on the originals uses the same mfind mjob mpipe pattern as did the thumbnail method above. The mjob runs the ImageMagick convert command called with the color preservation arguments. Note here I used the --memory 2048 to increase the memory of the Manta Storage Service compute job to 2GB.

added 1000 inputs to 47193818-1aab-c64b-bdd6-9440c2c039c1
added 1000 inputs to 47193818-1aab-c64b-bdd6-9440c2c039c1
added 1000 inputs to 47193818-1aab-c64b-bdd6-9440c2c039c1
added 1000 inputs to 47193818-1aab-c64b-bdd6-9440c2c039c1
added 596 inputs to 47193818-1aab-c64b-bdd6-9440c2c039c1

real    3m11.674s
user    0m3.096s
sys     0m0.357s

Again, I can share the results with you with mjob share:

mjob share 47193818-1aab-c64b-bdd6-9440c2c039c1

creates this page:

So the resize job on my Mac takes 13703 seconds and on Manta it takes 191 seconds.
Manta is 72x faster than my notebook at resizing 100Gb of image data.

$ mls -l /cwvhogue/public/art | head
-rwxr-xr-x 1 cwvhogue      23987315 Aug 23 20:27 00000201.jpg
-rwxr-xr-x 1 cwvhogue         48157 Sep 25 17:11 00000201_s.jpg
-rwxr-xr-x 1 cwvhogue      22151707 Aug 23 20:27 00000301.jpg
-rwxr-xr-x 1 cwvhogue         56967 Sep 25 17:11 00000301_s.jpg
-rwxr-xr-x 1 cwvhogue      19408035 Aug 23 20:27 00000401.jpg
-rwxr-xr-x 1 cwvhogue         44910 Sep 25 17:11 00000401_s.jpg

Want to try to reproduce this yourself? This Gist will move some (default 20) or all of the Getty Originals into your own Manta account.

That concludes Part 3. In Part 4 I will roll out a parameterized shell script with a loop that does small, medium and large resize operations, and directs the output into different Manta subdirectories. The script will also be used to extract the XML metadata buried in the JPEG files which will be used with a MapReduce mjob to craft a file of one-line annotations of each image in the set that you can search with grep.