Joyent Manta Storage Service: Image Manipulation and Publishing Part 4
So far in this blog series, I have covered the Getty Open Content image collection and some of the entry-level features of the Joyent Manta Storage Service for image manipulation, storage and web content publishing. Here I provide three advanced Manta job patterns and shell scripts that can be extended and reused to build your own image storage and processing systems.
Manta is an armada of storage servers with computrons - at the ready - to do your bidding and compute on your stored objects!
The Roman Fleet Victorious over the Carthaginians at the Battle of Cape Ecnomus; Gabriel Jacques de Saint-Aubin, French, 1724 - 1780; France, Europe; about 1763; Watercolor, gouache, pen and India and brown ink over black chalk; 21.5 x 39.6 cm (8 7/16 x 15 9/16 in.)
Manta Unifies the Dichotomy of Computing and Storage.
Manta is a web Object store with built-in compute units. These are billed by the second, at a price of $0.00004/GB DRAM•sec. The default Manta compute unit is a SmartOS zone instance starting with 1GB of DRAM and 8GB of transient disk space. This smallest elementary Manta compute unit, which I am calling a computron, is the building block for general purpose computing for serial, parallel or MapReduce compute tasks directly on the Manta Object store. The high level Manta computron features are:
UNIX Computing
- Instant-on capability with root permissions
- Runs arbitrary binary or run-time language code
- Reads code from from built-in packages or object store
assets
you provide
POSIX and Object Storage
- Read-only file system access to specified objects in your Manta object store account
- Read/Write access to a temporary POSIX filesystem, e.g.
/var/tmp
Networking
- Fast LAN access to write data onto the Manta object store via
mput
,mpipe
,muntar
- WAN access to pull/push information anywhere on the Internet, e.g.
cURL
Production-Grade Image Manipulation and Publishing
Here are three production-grade examples of how I direct Manta computrons to automate the upload, manipulation, extraction of metadata, and publishing of the Getty Open Image originals using Unix shell scripts.
Moving data to Manta. Get Manta to feed itself public or web hosted data with "pull"
mjob
patterns.Automated Image Manipulation and Publishing. A production-ready shell script that outputs small/medium/large resized or converted images, or XMP metadata to Manta subdirectories.
MapReduce Metadata Consolidation. Automate the post-process output from a Manta MapReduce
mjob
to sort and post image description one-liners directly into your/public
Manta directory.
Moving Data to Manta
Computrons - Deploy the
cURL
Tractor Beam!
So the Getty Open image content set is 101.6GB of JPEG images, and a total of 4,596 files. Simply transferring this information onto the Manta Storage Service takes some time. The fastest way is to use a Manta mjob
script to pull the data into the Object store like a tractor beam.
The Manta mjob
for pulling the Getty Originals into your own account is here:
I provide a list of the JPEG original filenames as filelist.txt
posted up on Manta for public downloads. This small file is downloaded to your notebook and pushed onto your Mantaaccount. The version pushed to your account becomes input for an mjob
via the echo
command which sends the filelist.txt
object as stdin
into the mjob
. Then the xargs
command is used to break apart the file list, so that xargs
issues cURL
commands piped to mput
commands with each of the filenames in the list. These inner commands cURL
each file one at a time, pulling it as a stream into the Manta compute zone and mput
the stream data immediately onto your Manta object storage account. The pull approach works very well for retrieving web hosted data sets.
Another example of the pull approach is for muntar
. If you issue muntar
from your notebook, it runs on your notebook's cpu, extracts the tar archive on your local disk, and pushes the files onto Manta over the Internet one at a time. If that tar archive was a download, then you are in a slow, round-trip data movement situation. Instead you can use an mjob
that uses a cURL
piped to mput
to pull the tar archive into Manta, then run muntar
within an second mjob
, like so:
SITE="https://us-east.manta.joyent.com/mantademo/public/images/getty-open" DEST="/$MANTA_USER/public" mjob create -w -r "curl -ksL ${SITE}/demojpgs.tar | \ mput ${DEST}/demojpgs.tar" < /dev/null echo ${DEST}/demojpgs.tar | mjob create -o -m \ 'muntar -f $MANTA_INPUT_FILE /$MANTA_USER/public'
Pull strategies like this prevent all those bits travelling first to your notebook and then back to Manta.
So that, folks, is the Manta mjob
, cURL
, mput
, muntar
tractor-beam or pull pattern.
You will find it has the following advantages:
- Faster Internet connection within the Joyent Data Centre, compared to pulling to and pushing from your notebook.
- No additional instance provisioning, setup, deletion steps required to pull web data onto Manta Storage Service.
- Inbound bandwidth usage is free on Manta.
- No need to tie up your notebook or an instance for long data moves.
Automate Image Resize, Conversion, and Metadata Extraction
Computrons - Deploy the
convert
Bit Disruptor!
From Part 1, I set up the organization of the Getty Open Image set to be resized into the following subdirectory structure. I chose to set up this structure so that image size attribute is a property of the subdirectory name, not the filename, like so:
/mantademo/public/images/getty-open/500.jpg.jpg
small -0.25 Megapixel/mantademo/public/images/getty-open/1000.jpg.jpg
medium -1 Megapixel/mantademo/public/images/getty-open/2000.jpg.jpg
large -4 Megapixel
I automated the population of all three of these directories with a parameterized shell script manta_image_convert.sh
that outputs small/medium/large resized images into this directory hierarchy structure.
You can find manta_image_convert.sh
here on GitHub:
https://github.com/cwvhogue/manta-getty/blob/master/manta_image_convert.sh
Feel free to fork and re-use to populate your own Manta image subdirectories in the sizes and settings of your choice.
To run it, you need to have some of the Getty Originals on your account, you can run the Manta_Getty_Originals.sh
script (the first script in this Blog post, above). That will populate 20 originals into your username/public/originals
subdirectory. If you want more you can modify the script according to the comments.
To create 3 sets of directories with small, medium and large JPEG images run
./manta_image_convert.sh 1 3
To create a directory with the XMP metadata extracted from the images run:
./manta_image_convert.sh 3 1
At the core of this script, lines 189-190
, is an mjob
that runs the ImageMagick convert
program, (as I showed in Part 3), wrapped in a number of shell variables that hold the arguments for resizing and the input and output directory and object information. Here are those lines:
This resize job is controlled by a loop that runs 1, 2 or 3 iterations to make small, medium and large resized image sets using the specific size settings found in the shell variables, and putting the resulting images in specific subdirectories.
Before the resize loop is run, the original file names are captured into a list with an mjob
that uses a Reduce phase to put the output into a Manta object file called input_list.txt
. This is on lines 132-135 which are shown here:
I will explain this a bit more in the next section, but here the script runs mfind
on the Manta compute zone rather than on our notebook and directs its output to the POSIX file area/var/tmp/out.txt
. After that is complete the mjob
runs mput
placing the POSIX file output of mfind
onto Manta as a stored object. This mjob
has /dev/null
directed into it as input as it is a reduce only phase without any piped input.
In general, when an mjob
reduce phase has no input, it has to be explicitly directed /dev/null
like this so that it will close its input stream and start the reduce phase:
mjob create -w -r "mfind ~~/stor/foo > /var/tmp/foofiles.txt && \ mput -f /var/tmp/foofiles.txt ~~/stor/foofiles.txt" < /dev/null
MapReduce Metadata Consolidation
Computrons - POSIX data streams incoming, redirection shields up!
When I demonstrated MapReduce in Part 2, I admit the methodology looks clunky with all the steps involved in pulling the results out by mjob
identifier. Results were not sorted, and I had to use raw mjob outputs
to find the output file buried in some long hash-encoded Manta paths and then mget
to get the file out of private storage.
For the challenge of automating the creation of the one-line image descriptions from the XMP metadata I wanted to set up a shell script to do this all automatically and put the resulting sortedone-line descriptions file into a public directory on Manta. Specifically I wanted an automated MapReduce way to create this file:
/public/images/getty-open/130812_GettyOpen_descriptions.txt
So how did I get my reduce output (a) sorted and (b) placed neatly into my public directory without hunting through the job output directory all that mjob
ID hash stuff?
Here is a shell script that create the one-line Getty summaries using a MapReduce mjob
pattern that shows how:
https://github.com/cwvhogue/manta-getty/blob/master/getty_oneliners.sh
To get Manta's mjob
to do that work, take a look at the reduce phase in this example on line 22. The key here is this change from just -r cat
, which does not redirect output, to:
-r cat > /var/tmp/out.txt && sort -n /var/tmp/out.txt | mput ...
This pattern again takes advantage of the POSIX writeable filesystem space within the Manta mjob
in /var/tmp
, to which we have redirected the output of the mjob
reduce phase using the >
operator. The &&
operator executes the sort
command, only after the cat
operation has finished writing the POSIX output file /var/tmp/out.txt
. Then the sort
output stream gets piped as input to mput
, placing the sorted output onto the Manta object store.
Redirecting reduce output to the /var/tmp
POSIX writeable space is very useful for transformation pipelines that can be created like the above example. Uploaded shell scripts can also be used as Manta assets
for even more elaborate post-processing using this transient read-write space prior to a final mput
onto Manta's object store.
Epilogue
With all that done, I did spend a moment looking for images with meme potential. I was disappointed to find only two images with cats mentioned in the image description. There are more non-feline images with the characters "cat" in them including 8 with "Cathedral" and 20 with "Saint Catherine". And there are, oddly, seven blank page images in the Getty Open Content set. You can find these string patterns and image names simply using grep
:
grep "Blank Page" 130812_GettyOpen_descriptions.txt grep "Cathedral" 130812_GettyOpen_descriptions.txt
Download 130812_GettyOpen_descriptions.txt
here..
In previous installments of this Blog series, Part 1 covered the Getty Open Content Image set, hierarchical directories in Manta and extracting the XMP formatted metadata from inside the JPEG formatted original files. References to WebP images are incorrect as explained in...
Part 2 was a set of worked examples for installing Manta and learning basic Manta job patterns, including simple MapReduce analytics that extracted information about JPEG dimensions so that simple R commands let us quickly locate the widest and tallest images in the Getty set.
Part 3 provided worked examples of how to use ImageMagick resize commands on Manta to preserve color, and showed off the performance improvements of the Manta compute platform over a single multithreaded i7 CPU.
Follow Christopher Hogue @cwvhogue on Twitter
Post written by Christopher Hogue, Ph.D.