Joyent Manta Storage Service: Image Manipulation and Publishing Part 2 - Analytics with MapReduce

September 24, 2013 - by Christopher Hogue, Ph.D.

In Part 1 of this series I introduced the Getty Open Content image collection.

In this blog, I will explain how you can use the Manta Storage Service to inspect and validate the 4,596 images in the set and extract the image pixel dimesions from within the JPEG files. And I will show you how I found this - the widest image in the Getty Open Content collection.


10941201

Figures Walking in a Parkland; Louis Carrogis de Carmontelle, French, 1717 - 1806; France, Europe; 1783 - 1800; Watercolor and gouache with traces of black chalk underdrawing on translucent Whatman paper; 47.3 x 377 cm (18 5/8 x 148 7/16 in.); 96.GC.20

This piece of artwork is one of the two extreme points on the edge of a graph of dimensions of all the Getty Open Content Images, after they have been area normalized to 0.25 Megapixels with ImageMagick.

aspect_ratio

Deploying a Unix pipeline that computes on the stored objects you have put on the Manta Storage Service is straightforward. There is a mapping from a pipeline you construct on your notebook to the style of compute jobs Manta can process. In this blog we will look at how Manta's MapReduce features work by setting up an example on your local machine first.

This blog contains a How-To in four stages:

  1. Set up software on your local machine to mimic the Manta tools you will use.
  2. Walk through the steps of a simple MapReduce analysis on the Getty Open Content images on your local machine.
  3. Use the same MapReduce example on the Joyent Manta Storage Service.
  4. Retrieve the results, graph image size distribution, and find the widest and tallest image.

Installation List

For this blog you will need a Linux or Unix based computer with:

  • A Joyent account with your public/private keys installed
  • the Node.js based Manta Command Line Utilities
  • ImageMagick
  • cURL
  • R

You will need to set up a Joyent account for the Manta Storage Service.

If you are new to Joyent, there is a free trial link at the top right of this page that provides enough credit to work through the examples in this blog, and more.

The image manipulation tools are from the ImageMagick suite of command-line programs. The two key command line tools are programs called convert and identify. These are already installed on Manta, but for the first part, you will need them installed on your own machine.

Mac OS X Local Install

Manta's Command Line Utilities are Node.js programs, delivered and installed by the Node Package Manager npm.

Some of the Manta code requires Xcode to compile, so Xcode must be installed. You can check if Xcode is installed in a Terminal window with the command:

$ sudo xcode-select --version
xcode-select version 2311.

Not installed? Head over to the Mac OS X App Store and download the free Xcode developer tools.

Install and start Xcode and click on "Create a New Xcode Project" at the welcome screen, then cancel and quit Xcode. Run the above check again and it should be ready to go.

For a new Mac OSX install including Node.js, npm node-manta and its dependencies, use the complete package here.

If you already have Node.js installed, follow the instructions here.

Mac OS X users should go to CRAN to get the R package.

To install ImageMagick and cURL you can use MacPorts or Joyent's pkgsrc package managment system. Pkgsrc is a command line installation utility that Joyent uses for deploying packages to SmartOS and Mac OS X.

Instructions for installing pkgsrc are found here.

To install ImageMagick and curl with pkgin:

sudo pkgin install ImageMagic curl

Linux Local Install

Node.js for Linux can be installed from http://nodejs.org

The Manta Command Line Utilities are installed with:

npm install -g manta

Linux users can use package management tools to install ImageMagick, cURL and R (e.g. apt-get on Ubuntu, yum on RedHat/Fedora/CentOS, zypper on SuSE)

Windows Local Install

You can install Node.js from http://nodejs.org and the Manta Command Line Utilities with:

npm install -g manta

Go to CRAN to get the R package. ImageMagick is here. cURL is here.

While the How-To is a Unix based one in its instruction syntax, the Windows PowerShell supports pipes and the ForEach-Object construct is similar to the find & xargs combinations in Unix. Windows users can follow along for the most part, and use Manta directly from the Windows command shell with the Manta Command Line Utilities.

Manta Environment Variables

The command line Manta tools need three environment variables. One is your public SSL key for secure access. You can find these on the Joyent Dashboard page, they look like this:

export MANTA_URL=https://us-east.manta.joyent.com 
export MANTA_USER=your_user_name
export MANTA_KEY_ID=d0:4a:88:2a:f1:b3:2f:9b:57:09:c4:4b:83:1d:29:a7

Copy and paste into the .bashrc file in your home directory using a text editor and save the file. Then source the .bashrc file:

source ~/.bashrc 

to set the environment variables, check with:

$ echo $MANTA_USER 
your_user_name

Ready to go?

Once that is done, you can check to see that everything is installed by checking the versions. Note that you may not have exactly the same versions I show here.

$ convert --version
Version: ImageMagick 6.8.5-8 2013-07-12 Q16 http://www.imagemagick.org

$ curl --version
curl 7.31.0 (i386-apple-darwin10) libcurl/7.31.0 OpenSSL/1.0.1e zlib/1.2.8 libidn/1.27

$ R --version
R version 3.0.1 (2013-05-16) -- "Good Sport"

$ node --version
v0.10.16

$ which mlogin
/usr/local/bin/mlogin

Download the Small WebP form of the Getty Images

This demo uses shrunken versions of the original Getty Open images, which were a little over 100Gb of JPEG data in total, and a bit impractical for a short demonstration.

Once you have cURL installed, you can use it to download the tar archive of the small 0.25 megapixel Getty Open Content images:

Then extract the images with the tar command:

tar -xf getty_webp.tar 

Using ImageMagick on your local system

Assuming you extracted the archive into your home directory, you can find the files by changing directory here:

$ cd ~/var/tmp/500x500_webp
$ ls -al | head
total 346920
drwxr-xr-x  4599 cwvhogue  staff  156366 17 Sep 17:10 .
drwxr-xr-x     3 cwvhogue  staff     102 17 Sep 17:09 ..
-rw-r--r--     1 cwvhogue  staff   25720 10 Sep 15:14 00000201.webp
-rw-r--r--     1 cwvhogue  staff   31520 10 Sep 15:14 00000301.webp
-rw-r--r--     1 cwvhogue  staff   26567 10 Sep 15:14 00000401.webp
...

Including the README.txt file there are a total of 4,597 files in the directory, following.

Validating Image Data

Now we can use ImageMagick identify to check the image data format and report on the file dimensions and size of each:

identify *.webp > local_identify.txt

This outputs a file with one line per image like this:

$ head local_identify.txt 
00000201.webp JPEG 372x672 372x672+0+0 8-bit sRGB 25.7KB 0.000u 0:00.009
00000301.webp[1] JPEG 598x418 598x418+0+0 8-bit sRGB 31.5KB 0.000u 0:00.000
00000401.webp[2] JPEG 417x600 417x600+0+0 8-bit sRGB 26.6KB 0.000u 0:00.000
00000501.webp[3] JPEG 414x604 414x604+0+0 8-bit sRGB 18.3KB 0.000u 0:00.000
00000601.webp[4] JPEG 518x483 518x483+0+0 8-bit sRGB 28.5KB 0.000u 0:00.000
... 

The identify program interrogates the graphic format by looking inside the image. In this example, the image files have the .webp extension, but the internal encoding is reported as JPEG.

Uh oh, these files are NOT webp formatted.

They are JPEG files with a .webp extension, and not what I intended to make and post in the previous blog. These files were created with the ImageMagick command called like this:

ImageMagick in this case was not linked to the library it required for WebP conversion. Yes I will have to go back and fix Part 1.

The convert program blissfully ignored the command line argument in bold and simply carried out the conversion to make a JPEG file reduced to .25 Megapixel and with a quality setting of 50.

ImageMagick supports many graphical file image types and does not do any checking on the file to see that the internal encoding matches the filename extension. So it happily and silenty spits out a file that is JPEG encoded internally with a .webp extension. The files will open in Chrome from the web link, but not from FireFox or Safari. The .webp extension signals to the browser the image type. Rename the files to `.jpg` and they will open.*

So let's correct this problem and rename the files to *.jpg on our local system.

Wild Cards

Simple wild-cards do not work for renaming on Unix. There is no command to rename *.jpg to *.webp.
On older Unix systems the filesystem itself had limits on the number of files it could process with wildcard style commands. I have hit this wall while making hundreds of thousands of files in a Mac OS X directory.

The one liner to rename these files combines the Unix find and mv commands. It extracts the file name with the dirname and basename functions which are first resolved inside the backtick quotations.

The find command has an -exec capability. That is it can execute a shell with with any Unix command specified with -c Here it executes mv on every file it finds with the *.webp extension.

This type of command pattern is something like what you will find when using the mfind command which finds objects in the Manta directory hierarchy.

Let's go back to the identify command for a moment and I will show you a simple way to do a one machine MapReduce construct on your notebook. This will help you understand how things work on Manta, and show how you can mock up a job on Unix and move it on to the Manta Storage Service for computing.

The One Machine Unix Map

This runs identify separately and outputs one file for each image containing the one-line output of identify:

$ ls -al *.id | head
-rw-r--r--  1 cwvhogue  staff  74 20 Sep 08:49 00000201.id
-rw-r--r--  1 cwvhogue  staff  74 20 Sep 08:49 00000301.id
-rw-r--r--  1 cwvhogue  staff  74 20 Sep 08:49 00000401.id
-rw-r--r--  1 cwvhogue  staff  74 20 Sep 08:49 00000501.id
...

$ more 00000201.id
./00000201.jpg JPEG 372x672 372x672+0+0 8-bit sRGB 25.7KB 0.010u 0:00.000

So now you have all the outputs.

The One Machine Unix Reduce

We can do a reduce phase like so:

This finds all 4,596 *.id files we made in the Map phase and passes their names to the Unix xargs command. The xargs command doles out each filename as input to the cat command, which concatenates them into a single output file.

Voila, reduced! All of those small one line outputs from identify are collected into a single file.

$ head mr_identify.txt
./00000201.jpg JPEG 372x672 372x672+0+0 8-bit sRGB 25.7KB 0.010u 0:00.000
./00000301.jpg JPEG 598x418 598x418+0+0 8-bit sRGB 31.5KB 0.000u 0:00.009
./00000401.jpg JPEG 417x600 417x600+0+0 8-bit sRGB 26.6KB 0.000u 0:00.000
...

Now you have a file that looks a lot like the one previously made. Recall the wild card version we did earlier? Run it again over the name-corrected JPEG files with:

identify *.jpg > local_identify.txt

And now compare the two:

$ cat local_identify.txt | wc -l
    4596
$ cat mr_identify.txt | wc -l
    4596
$ head local_identify.txt 
00000201.jpg JPEG 372x672 372x672+0+0 8-bit sRGB 25.7KB 0.000u 0:00.009
00000301.jpg[1] JPEG 598x418 598x418+0+0 8-bit sRGB 31.5KB 0.000u 0:00.000
00000401.jpg[2] JPEG 417x600 417x600+0+0 8-bit sRGB 26.6KB 0.000u 0:00.000
00000501.jpg[3] JPEG 414x604 414x604+0+0 8-bit sRGB 18.3KB 0.000u 0:00.000
...

$ head mr_identify.txt 
./00000201.jpg JPEG 372x672 372x672+0+0 8-bit sRGB 25.7KB 0.010u 0:00.000
./00000301.jpg JPEG 598x418 598x418+0+0 8-bit sRGB 31.5KB 0.000u 0:00.009
./00000401.jpg JPEG 417x600 417x600+0+0 8-bit sRGB 26.6KB 0.000u 0:00.000
./00000501.jpg JPEG 414x604 414x604+0+0 8-bit sRGB 18.3KB 0.000u 0:00.009
...

So if you are with me so far, you can now see how MapReduce works on a single machine using simple Unix commands, and get an idea how wildcard expansion is facilitated by the Unix find command.

Certainly the wildcard version is faster on your notebook than the MapReduce construct. But this is not a Big Data example, and that is where this begins to matter.

The Manta MapReduce Version - Computing on the Object Store

So not only is Manta an object store, it is also a massive distributed computer!

Let me show you how to take what you just did on your notebook on the Manta Storage Service.

First, let's put the data up on to Manta starting with a tar archive of the JPEG images with the proper .jpg filenames you just made above with these commands:

mkdir getty
mv *.jpg getty
tar -cvf getty.tar getty
mput -f getty.tar /$MANTA_USER/stor

Depending on your upload bandwidth this may take a while. This is from my home internet provider:

$ mput -f getty.tar /$MANTA_USER/stor
...ue/stor/getty.tar [=======================>] 100% 163.95MB  55.27KB/s 50m37s

While you are waiting, let me cover some of the basics about Manta's default directories.

Manta provides a number of Command Line Utilities that resemble a number of familiar Unix commands including mls, mfind and muntar.

The Manta directories that come with your account are as follows

/manta-username/stor
/manta-username/public
/manta-username/reports
/manta-username/jobs

Your private data goes into stor. Public data that is accessible to anyone on the web is published by simply putting it into public. The reports directory provides you with access log and billing information.

The jobs directory holds information about compute jobs you run on Manta with the mjob command, including tracking information for every job you run, job outputs, errors and all the error messages that programs may emit while being run.


Once the upload is complete, the mput command will have moved the getty.tar object from your computer into the /stor subdirectory on your Manta account.

$ mls /$MANTA_USER/stor
getty.tar

Now we will uncompress the archive and create objects in /$MANTA_USER/public

added 1 input to d228bed1-a4f4-47c9-b7bf-44e614d654d8
...
/cwvhogue/public/getty/33426901.jpg
/cwvhogue/public/getty/33425101.jpg
/cwvhogue/public/getty/32345701.jpg
/cwvhogue/public/getty/32346001.jpg
/cwvhogue/public/getty/00574601.jpg

This creates the /$MANTA_USER/public/getty subdirectory and makes objects on the Manta Storage Service from the contents of the stored Unix filesystem in the getty.tar archive.

You can see them in any browser like this, they are now public and accessible to anyone.

https://us-east.manta.joyent.com/cwvhogue/public/getty/10941201.jpg.

Importantly you have just distributed copies of your objects onto more than one high performance multicore server in the Joyent Manta Storage Service where they are set up for some fast computing capabilities without moving the stored objects.

Now we are ready to do the validation with ImageMagick identify on Manta.

Recall the One Machine MapReduce example - the identify command was run on each file on your notebook, making lots of one-line files.

Then cat was used to collect these all into one file. Both of these were initiated from the find command.

On Manta, the MapReduce process that does the equivalent find and identify command uses the mfind. The mjob command is like the -exec part of Unix find but with distributed computing SUPERPOWERS.

Go ahead and try it:

b1011a15-64a8-463d-8329-de1909c149fb
added 1000 inputs to b1011a15-64a8-463d-8329-de1909c149fb
added 1000 inputs to b1011a15-64a8-463d-8329-de1909c149fb
added 1000 inputs to b1011a15-64a8-463d-8329-de1909c149fb
added 596 inputs to b1011a15-64a8-463d-8329-de1909c149fb
added 1000 inputs to b1011a15-64a8-463d-8329-de1909c149fb
mjob: error: job b1011a15-64a8-463d-8329-de1909c149fb had 1 error

Here mfind is the Manta version of the Unix find command. The wildcard -n "jpg$" is a Javascript wildcard, as mfind is a Node.js application. The results of mfind are returned to your notebook and piped via the Unix | pipe into the mjob command.

The mjob command takes this list of Manta Objects and executes the identify command by distributing it across the Manta compute nodes. The MapReduce map phase command comes after the -m flag.

The reduce phase is again the cat command, just as in the One Machine Reduce example above, and this comes after the reduce phase flag the -r.

The job number is in this case b1011a15-64a8-463d-8329-de1909c149fb
You can retrieve the job with (use your own job number):

$ mjob get  b1011a15-64a8-463d-8329-de1909c149fb
{
  "id": "b1011a15-64a8-463d-8329-de1909c149fb",
  "name": "",
  "state": "done",
  "cancelled": false,
  "inputDone": true,
  "stats": {
    "errors": 1,
    "outputs": 1,
    "retries": 0,
    "tasks": 4597,
    "tasksDone": 4597
  },
  "timeCreated": "2013-09-20T17:49:05.830Z",
  "timeDone": "2013-09-20T17:52:05.926Z",
  "timeArchiveStarted": "2013-09-20T17:55:48.277Z",
  "timeArchiveDone": "2013-09-20T17:52:14.489Z",
  "phases": [
    {
      "exec": "identify $MANTA_INPUT_FILE",
      "type": "map"
    },
    {
      "exec": "cat",
      "type": "reduce"
    }
  ],
  "options": {}
}

The output of cat is stored in your jobs directory.

You can retrieve the output object store key with the command:

$ mjob outputs  b1011a15-64a8-463d-8329-de1909c149fb
/cwvhogue/jobs/b1011a15-64a8-463d-8329-de1909c149fb/stor/reduce.1.124d6c06-07ba-4481-862a-31d68b530e6f

Then you can pull the output file down to your computer for inspection with mget, like so:

$ mget /cwvhogue/jobs/b1011a15-64a8-463d-8329-de1909c149fb/stor/reduce.1.124d6c06-07ba-4481-862a-31d68b530e6f > manta_identify.txt
...862a-31d68b530e6f [=======================>] 100% 452.30KB  

(Yes there is one error in this job, ImageMagick has difficulties with 00082001.jpg - looking into it...)

Now you can see the results of the one-line MapReduce job you just performed on Manta.

$ head manta_identify.txt 
/manta/cwvhogue/public/getty/18219301.jpg JPEG 405x617 405x617+0+0 8-bit sRGB 33KB 0.000u 0:00.000
/manta/cwvhogue/public/getty/00441401.jpg JPEG 418x599 418x599+0+0 8-bit sRGB 44.5KB 0.000u 0:00.000
/manta/cwvhogue/public/getty/00070501.jpg JPEG 572x437 572x437+0+0 8-bit sRGB 32.9KB 0.000u 0:00.000
/manta/cwvhogue/public/getty/00028701.jpg JPEG 576x434 576x434+0+0 8-bit sRGB 27.8KB 0.000u 0:00.000
/manta/cwvhogue/public/getty/00090401.jpg JPEG 622x402 622x402+0+0 8-bit sRGB 25.4KB 0.000u 0:00.000

Now the MapReduce results are concatenated in order of when they appear in the mjob distributed compute queue, and so they are not sorted.

You can sort the results by stripping off the leading directory name with sed 's/\/manta\/cwvhogue\/public\/getty\///' then sorting by number like so:

$ cat manta_identify.txt | sed 's/\/manta\/cwvhogue\/public\/getty\///' | sort -n > manta_identify_sorted.txt

$ head manta_identify_sorted.txt 
00000201.jpg JPEG 372x672 372x672+0+0 8-bit sRGB 25.7KB 0.000u 0:00.000
00000301.jpg JPEG 598x418 598x418+0+0 8-bit sRGB 31.5KB 0.000u 0:00.000
00000401.jpg JPEG 417x600 417x600+0+0 8-bit sRGB 26.6KB 0.010u 0:00.000
00000501.jpg JPEG 414x604 414x604+0+0 8-bit sRGB 18.3KB 0.000u 0:00.000
00000601.jpg JPEG 518x483 518x483+0+0 8-bit sRGB 28.5KB 0.000u 0:00.000
00000701.jpg JPEG 408x613 408x613+0+0 8-bit sRGB 27.8KB 0.000u 0:00.009

Graphing Image Size Distribution in R

In Part 1 I made a graph of the image size distribution of the original Getty Images in R. I will show you how I did this using the manta_identify_sorted.txt (or if you skipped that part, you can use the mr_identify.txt file).

And since the data we have includes the width and height, we will also use R to plot the aspect ratio across the image set, and retrieve the widest and tallest images from within the set. For this I use awk which interprets the columns of the input file according to the numbered variables at the bottom.

00000201.jpg JPEG 372x672 372x672+0+0 8-bit sRGB 25.7KB 0.000u 0:00.000
00000301.jpg JPEG 598x418 598x418+0+0 8-bit sRGB 31.5KB 0.000u 0:00.000
00000401.jpg JPEG 417x600 417x600+0+0 8-bit sRGB 26.6KB 0.010u 0:00.000
|----$1----| |$2| |--$3-| |---$4----| |-$5| |$6| |-$7-| |-$8-| |--$9--|

To convert this to a .csv comma separated value file, which you can read into R (or a spreadsheet program), I use awk to extract the columns with the filename, the image pixel dimensions and the file size. Then I use a number of sed commands to add spaces in between the number and units of the size, i.e. 25.7KB becomes 25.7 KB and 372x672+0+0 becomes 372x672 and then 372 672. Finally awk prints each of these tidied up lines with commas between them.

If you are starting from the file you made on your notebook, use this:

This is what your Getty_Filesizes.csv file should look like: Download

$ head Getty_Filesizes.csv 
11.7, KB, 03816901.jpg, 399, 626
12.5, KB, 04259501.jpg, 446, 561
12.5, KB, 04663401.jpg, 508, 492
12.9, KB, 04704801.jpg, 588, 425
12.9, KB, 13371601.jpg, 570, 439
13, KB, 00807301.jpg, 344, 726
13, KB, 03817001.jpg, 398, 628
13, KB, 05985201.jpg, 568, 440
13.6, KB, 06845101.jpg, 543, 460
13.7, KB, 03816701.jpg, 399, 626

All should be in KB, so this command should return nothing - it uses grep to find any lines without the KB string:

grep -v -e 'KB' Getty_Filesizes.csv 

Now we are ready to load this file into R. First here is the R session I did to make the graphs. The lines to type in start with the R prompt >.

$ R

R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)
...

First we will load in the Getty_Filesizes.csv file and look at it with the R ls() and head() commands:

> getty_sizes<-read.csv(header=FALSE, "Getty_Filesizes.csv")
> ls()
[1] "getty_sizes"
> head(getty_sizes)
    V1  V2            V3  V4  V5
1 11.7  KB  03816901.jpg 399 626
2 12.5  KB  04259501.jpg 446 561
3 12.5  KB  04663401.jpg 508 492
4 12.9  KB  04704801.jpg 588 425
5 12.9  KB  13371601.jpg 570 439
6 13.0  KB  00807301.jpg 344 726

Then we can plot a histogram of the file size distribution, with these two commands:

> hist(getty_sizes[,1],breaks=1000, main="Getty Open Image Size at 0.25 Megapixel", xlab="KB")
> rug(getty_sizes[,1])

Here is the cut & paste R command Gist for this plot:

screen-shot-2013-09-20-at-9.17.36-am

I was interested in finding the Getty Open Content images with the most extreme aspect ratios, really long or really tall graphics.

So with the information in the file, we can make an x-y plot of the width and height of all the images. These are all approximately the same area in number of pixels, so we get a nice smooth curve. Images with dimensions 500 x 500 will be perfectly square.

plot(getty_sizes[,5] ~ getty_sizes[,4], main="Aspect Ratio", xlab="width (pixels)", ylab="height (pixels)")
screen-shot-2013-09-20-at-9.37.15-am

There are outliers on either ends of the graph! They can be retrieved using R's which.max() command which returns the array index of the maxmum value in a vector.

For the widest image shown at the top of the blog post:

> getty_sizes[which.max(getty_sizes[,4]),]
     V1  V2            V3   V4  V5
4216 54  KB  10941201.jpg 1473 170

And for the tallest image:

> getty_sizes[which.max(getty_sizes[,5]),]
       V1  V2            V3  V4  V5
2027 34.6  KB  00066101.jpg 265 943

00066101

That wraps up Part 2. In Part 3 I will go over image resizing and show you how just how fast Manta is on the Getty Open Content Orignals compared to running the same conversion on my notebook. After that I will cover how to extract the XML metadata buried inside the Getty Open Content images and MapReduce that into a file of one-line file descriptions.

If you are interested to learn more about R - see my video on Simple Graphing in R.

Venus on the Waves; François Boucher, French, 1703 - 1770; 1769; Oil on canvas; Unframed: 265.7 x 76.5 cm (104 5/8 x 30 1/8 in.), Framed: 273.1 x 86.7 x 6.4 cm (107 1/2 x 34 1/8 x 2 1/2 in.); 71.PA.54

:

Sign up now for Instant Cloud Access Get Started