Joyent Manta Storage Service: Image Manipulation and Publishing Part 4

So far in this blog series, I have covered the Getty Open Content image collection and some of the entry-level features of the Joyent Manta Storage Service for image manipulation, storage and web content publishing. Here I provide three advanced Manta job patterns and shell scripts that can be extended and reused to build your own image storage and processing systems.


Manta is an armada of storage servers with computrons - at the ready - to do your bidding and compute on your stored objects!

The Roman Fleet Victorious over the Carthaginians at the Battle of Cape Ecnomus; Gabriel Jacques de Saint-Aubin, French, 1724 - 1780; France, Europe; about 1763; Watercolor, gouache, pen and India and brown ink over black chalk; 21.5 x 39.6 cm (8 7/16 x 15 9/16 in.)

Manta Unifies the Dichotomy of Computing and Storage.

Manta is a web Object store with built-in compute units. These are billed by the second, at a price of $0.00004/GB DRAM•sec. The default Manta compute unit is a SmartOS zone instance starting with 1GB of DRAM and 8GB of transient disk space. This smallest elementary Manta compute unit, which I am calling a computron, is the building block for general purpose computing for serial, parallel or MapReduce compute tasks directly on the Manta Object store. The high level Manta computron features are:

UNIX Computing

  • Instant-on capability with root permissions
  • Runs arbitrary binary or run-time language code
  • Reads code from from built-in packages or object store assets you provide

POSIX and Object Storage

  • Read-only file system access to specified objects in your Manta object store account
  • Read/Write access to a temporary POSIX filesystem, e.g. /var/tmp

Networking

  • Fast LAN access to write data onto the Manta object store via mput, mpipe, muntar
  • WAN access to pull/push information anywhere on the Internet, e.g. cURL


Production-Grade Image Manipulation and Publishing

Here are three production-grade examples of how I direct Manta computrons to automate the upload, manipulation, extraction of metadata, and publishing of the Getty Open Image originals using Unix shell scripts.

  1. Moving data to Manta. Get Manta to feed itself public or web hosted data with "pull" mjob patterns.

  2. Automated Image Manipulation and Publishing. A production-ready shell script that outputs small/medium/large resized or converted images, or XMP metadata to Manta subdirectories.

  3. MapReduce Metadata Consolidation. Automate the post-process output from a Manta MapReduce mjob to sort and post image description one-liners directly into your /public Manta directory.

Moving Data to Manta


Computrons - Deploy the cURL Tractor Beam!


So the Getty Open image content set is 101.6GB of JPEG images, and a total of 4,596 files. Simply transferring this information onto the Manta Storage Service takes some time. The fastest way is to use a Manta mjob script to pull the data into the Object store like a tractor beam.

The Manta mjob for pulling the Getty Originals into your own account is here:

I provide a list of the JPEG original filenames as filelist.txtposted up on Manta for public downloads. This small file is downloaded to your notebook and pushed onto your Mantaaccount. The version pushed to your account becomes input for an mjob via the echo command which sends the filelist.txt object as stdin into the mjob. Then the xargs command is used to break apart the file list, so that xargs issues cURL commands piped to mput commands with each of the filenames in the list. These inner commands cURL each file one at a time, pulling it as a stream into the Manta compute zone and mput the stream data immediately onto your Manta object storage account. The pull approach works very well for retrieving web hosted data sets.

Another example of the pull approach is for muntar. If you issue muntar from your notebook, it runs on your notebook's cpu, extracts the tar archive on your local disk, and pushes the files onto Manta over the Internet one at a time. If that tar archive was a download, then you are in a slow, round-trip data movement situation. Instead you can use an mjob that uses a cURL piped to mput to pull the tar archive into Manta, then run muntar within an second mjob, like so:

    SITE="https://us-east.manta.joyent.com/mantademo/public/images/getty-open"    DEST="/$MANTA_USER/public"    mjob create -w -r "curl -ksL ${SITE}/demojpgs.tar  | \    mput ${DEST}/demojpgs.tar" < /dev/null    echo ${DEST}/demojpgs.tar | mjob create -o -m \    'muntar -f $MANTA_INPUT_FILE /$MANTA_USER/public'

Pull strategies like this prevent all those bits travelling first to your notebook and then back to Manta.

So that, folks, is the Manta mjob, cURL, mput, muntar tractor-beam or pull pattern.
You will find it has the following advantages:

  1. Faster Internet connection within the Joyent Data Centre, compared to pulling to and pushing from your notebook.
  2. No additional instance provisioning, setup, deletion steps required to pull web data onto Manta Storage Service.
  3. Inbound bandwidth usage is free on Manta.
  4. No need to tie up your notebook or an instance for long data moves.



Automate Image Resize, Conversion, and Metadata Extraction


Computrons - Deploy the convert Bit Disruptor!


From Part 1, I set up the organization of the Getty Open Image set to be resized into the following subdirectory structure. I chose to set up this structure so that image size attribute is a property of the subdirectory name, not the filename, like so:

/mantademo/public/images/getty-open/500.jpg.jpg small -0.25 Megapixel
/mantademo/public/images/getty-open/1000.jpg.jpg medium -1 Megapixel
/mantademo/public/images/getty-open/2000.jpg.jpg large -4 Megapixel

I automated the population of all three of these directories with a parameterized shell script manta_image_convert.sh that outputs small/medium/large resized images into this directory hierarchy structure.

You can find manta_image_convert.sh here on GitHub:


https://github.com/cwvhogue/manta-getty/blob/master/manta_image_convert.sh


Feel free to fork and re-use to populate your own Manta image subdirectories in the sizes and settings of your choice.

To run it, you need to have some of the Getty Originals on your account, you can run the Manta_Getty_Originals.sh script (the first script in this Blog post, above). That will populate 20 originals into your username/public/originals subdirectory. If you want more you can modify the script according to the comments.

To create 3 sets of directories with small, medium and large JPEG images run

     ./manta_image_convert.sh 1 3

To create a directory with the XMP metadata extracted from the images run:

     ./manta_image_convert.sh 3 1

At the core of this script, lines 189-190, is an mjob that runs the ImageMagick convert program, (as I showed in Part 3), wrapped in a number of shell variables that hold the arguments for resizing and the input and output directory and object information. Here are those lines:

This resize job is controlled by a loop that runs 1, 2 or 3 iterations to make small, medium and large resized image sets using the specific size settings found in the shell variables, and putting the resulting images in specific subdirectories.

Before the resize loop is run, the original file names are captured into a list with an mjob that uses a Reduce phase to put the output into a Manta object file called input_list.txt. This is on lines 132-135 which are shown here:

I will explain this a bit more in the next section, but here the script runs mfind on the Manta compute zone rather than on our notebook and directs its output to the POSIX file area/var/tmp/out.txt. After that is complete the mjob runs mput placing the POSIX file output of mfind onto Manta as a stored object. This mjob has /dev/null directed into it as input as it is a reduce only phase without any piped input.

In general, when an mjob reduce phase has no input, it has to be explicitly directed /dev/null like this so that it will close its input stream and start the reduce phase:

    mjob create -w -r "mfind ~~/stor/foo > /var/tmp/foofiles.txt && \    mput -f /var/tmp/foofiles.txt ~~/stor/foofiles.txt" < /dev/null



MapReduce Metadata Consolidation


Computrons - POSIX data streams incoming, redirection shields up!


When I demonstrated MapReduce in Part 2, I admit the methodology looks clunky with all the steps involved in pulling the results out by mjob identifier. Results were not sorted, and I had to use raw mjob outputs to find the output file buried in some long hash-encoded Manta paths and then mget to get the file out of private storage.

For the challenge of automating the creation of the one-line image descriptions from the XMP metadata I wanted to set up a shell script to do this all automatically and put the resulting sortedone-line descriptions file into a public directory on Manta. Specifically I wanted an automated MapReduce way to create this file:

/public/images/getty-open/130812_GettyOpen_descriptions.txt

So how did I get my reduce output (a) sorted and (b) placed neatly into my public directory without hunting through the job output directory all that mjob ID hash stuff?

Here is a shell script that create the one-line Getty summaries using a MapReduce mjob pattern that shows how:


https://github.com/cwvhogue/manta-getty/blob/master/getty_oneliners.sh


To get Manta's mjob to do that work, take a look at the reduce phase in this example on line 22. The key here is this change from just -r cat, which does not redirect output, to:

     -r cat > /var/tmp/out.txt && sort -n /var/tmp/out.txt | mput ...

This pattern again takes advantage of the POSIX writeable filesystem space within the Manta mjob in /var/tmp, to which we have redirected the output of the mjob reduce phase using the > operator. The && operator executes the sort command, only after the cat operation has finished writing the POSIX output file /var/tmp/out.txt. Then the sort output stream gets piped as input to mput, placing the sorted output onto the Manta object store.

Redirecting reduce output to the /var/tmp POSIX writeable space is very useful for transformation pipelines that can be created like the above example. Uploaded shell scripts can also be used as Manta assets for even more elaborate post-processing using this transient read-write space prior to a final mput onto Manta's object store.



Epilogue


With all that done, I did spend a moment looking for images with meme potential. I was disappointed to find only two images with cats mentioned in the image description. There are more non-feline images with the characters "cat" in them including 8 with "Cathedral" and 20 with "Saint Catherine". And there are, oddly, seven blank page images in the Getty Open Content set. You can find these string patterns and image names simply using grep:

    grep "Blank Page" 130812_GettyOpen_descriptions.txt    grep "Cathedral" 130812_GettyOpen_descriptions.txt


Download 130812_GettyOpen_descriptions.txt here..


In previous installments of this Blog series, Part 1 covered the Getty Open Content Image set, hierarchical directories in Manta and extracting the XMP formatted metadata from inside the JPEG formatted original files. References to WebP images are incorrect as explained in...

Part 2 was a set of worked examples for installing Manta and learning basic Manta job patterns, including simple MapReduce analytics that extracted information about JPEG dimensions so that simple R commands let us quickly locate the widest and tallest images in the Getty set.

Part 3 provided worked examples of how to use ImageMagick resize commands on Manta to preserve color, and showed off the performance improvements of the Manta compute platform over a single multithreaded i7 CPU.

Follow Christopher Hogue @cwvhogue on Twitter


Unknown; Ghent (written), Belgium, Europe; 1469; Tempera colors, gold leaf, gold paint, silver paint, and ink on parchment; Leaf: 12.4 x 9.2 cm (4 7/8 x 3 5/8 in.);



Post written by Christopher Hogue, Ph.D.