Joyent Manta Storage Service: Image Manipulation and Publishing

September 12, 2013 - by Christopher Hogue, Ph.D.

Part 1: The Getty Open Content Image Set.

On August 12, 2013 The J. Paul Getty Trust announced a new commitment to sharing the Getty’s digital resources freely with all, and launched the Getty Open Content Program. The initial set of 4,596 high quality digital art images was made available on Getty’s website for use without restriction.

At Joyent we recognized the enormous cultural value of this open content art collection, and that it represented an opportunity to show off the Joyent Manta Storage Service platform. So I have written a series of blogs that show how Manta and its compute capability can host, validate, resize, reformat, archive, checksum and serve this unique collection of digital images.

For the first in this series, let me cover the basics about the content and structure of the Getty Open image collection. I will show the wide distribution of image size found in the set, spanning 3 to over 300 MB. I will also explain how to get the associated metadata, and provide links to download a copy of the image set in both JPEG and WebP image formats. The compact WebP format images can be viewed with Chrome.

Getty Open Content Image Distribution

After some effort in downloading and validating the image set, I can start by summarizing the Getty Open Content in terms of the total original image file total size and the distribution of image sizes. They are fairly large JPEG formatted files comprising a total of 101.6 GB of historically important pieces of digital art, sculpture, artefacts and photographs.

04 The distribution of all 4,596 image sizes is shown in the histogram above. The peak on the left shows most of the images are around 20 MB, but 14 are larger than 100 MB. You can see the relative pixel size of the two smallest images in the figure, which are photographs of a ring and a small carved female figure. These have file sizes of 3 and 3.7 MB respectively. Five of the largest images in the collection are shown, ranging in size from 188.5 to 327.5 MB, which are:

00099001.jpg 327.5MB: View of the Grand Canal and the Dogana
- by Bernardo Bellotto, about 1740.
- Reduced to 4, 1, 0.25 Megapixel JPEG.
- Reduced to 4, 1, 0.25 Megapixel WebP.
- XMP Metadata.

00056101.jpg 277.7MB: Clorinda Rescuing Sofronia and Olindo
- by Mattia Preti, about 1660.
- Reduced to 4, 1, or 0.25 Megapixel JPEG.
- Reduced to 4, 1, or 0.25 Megapixel WebP.
- XMP Metadata.

00104401.jpg 248.3MB: Madonna, Saint Thomas Aquinas, and Saint Paul
- by Bernardo Daddi, about 1330.
- Reduced to 4, 1, or 0.25 Megapixel JPEG.
- Reduced to 4, 1, or 0.25 Megapixel WebP.
- XMP Metadata.

00063101.jpg 200.9MB: Diana and Her Nymphs on the Hunt
- from the Workshop of Peter Paul Rubens, about 1615.
- Reduced to 4, 1, or 0.25 Megapixel JPEG.
- Reduced to 4, 1, or 0.25 Megapixel WebP.
- XMP Metadata.

11033001.jpg 188.5MB: John, Fourteenth Lord Willoughby de Broke, and his Family
- by Johann Zoffany, about 1766.
- Reduced to 4, 1, or 0.25 Megapixel JPEG.
- Reduced to 4, 1, or 0.25 Megapixel WebP.
- XMP Metadata.

Getty Open Content Image Data, Museum Object Identifiers and Metadata

The Getty Open Content image set is available under the terms and conditions from the J. Paul Getty Museum, and I have reproduced these here. Oddly, the image data set is not circumscribed by a single download, or a file list, but rather by a specific URL based query:

http://search.getty.edu/gateway/search?q=&cat=highlight&f=%22Open+Content+Images%22&rows=10&srt=a&dir=s&pg=1

This could, of course, change at any time by updates to their database system.

A way to download all the image originals has not been provided, nor are there any checksums for download integrity.

You can go page-by-page through the thumbnails a few at a time and click download. I did that for one image before working out how to download the set, and I was directed to a form requesting information about my planned use of the image. If you decide to work with Getty Open Content image resource in any form, please download one image from the Getty Museum directly and fill out their form.

To get all the filenames for each image, I altered the above query URL to provide a page of 5000 results by increasing the query default value of &rows=10. That trick gave me the whole image set displayed as thumbnails on a single HTML page, which I saved to my notebook. This step captured 4,599 thumbnails, which are named 00000201-T.jpg, 00000301-T.jpg ... 33681201-T.jpg. I mentioned there are 4,596 images, as three of the images are broken at the download source, namely the images 00660201.jpg, 03901101.jpg and 10821901.jpg. I will cover how I found these broken images with a one-line Manta compute job in Part 2 of this series.

I kept the list of image filenames, (minus the -T thumbnail part) for downloading. The image numbers are non-sequential, so the list of image names is important for operating on the set.

The last two digits in each file name before the .jpg extension are 01, which may be digits reserved for version numbering. The leading zeros in the filenames pad the numeric portions to 8 characters wide.

The plain image files can be linked to metadata information about where they came from, the artist, and location. There is curated metadata that is associated with the artwork on the Getty web site that you can retrieve by a web query, and there is embedded metadata in XMP format that can be retrieved from inside each image. You can retrieve information about the art from the source, using the base query together shown here together with a museum object id.

http://search.getty.edu/museum/records/musobject?objectid=

Construct the museum object number from the image file name by removing any leading zeroes and the last two digits 01.

Example: MetaData Recovery from Image File

I recognized the image 06841501.jpg as a famous photograph by Lewis Wickes Hine, an early photojournalist and former schoolteacher - turned child labor activist. Here you can see one of Lewis Hine's tricks of documenting child labor in America. He would first photograph children standing next to machinery, then measure the height of the machinery to establish the height of the child. Little Sadie Pfeiffer was only 48 inches tall when this photo was taken at the Lancaster Cotton Mills in Lancaster, South Carolina on Nov. 30, 1908.

01

Here is a zoom-in showing the detail of Sadie Pfeiffer's focused gaze, and her tattered apron, as reproduced from the high quality Getty Open original file image:

01-large

This image is well described in downloadable educational material made for teachers by the Getty Museum. Other web resources have lower-resolution versions of this image. For example, it is one in the set of over 5000 Lewis Hine images at http://www.lewishinephotographs.com/ (see here). The high resolution Getty Open image version shows remarkable clarity and detail. I noticed that the version of this photograph served by Art Institute Chicago (see here) is not cropped, whereas this high-resolution Getty Open version is cropped to remove the tattered edges of the photograph. This is a bit disappointing. While the number of Lewis Hine photos in the Getty Open image collection is a mere 6 at the moment, I hope this will increase as more of the collection is moved into the open content set. And I would plead, leave the cropping choice to the end-user.

So for this amazing photograph, the Getty Museum's object identifier is extracted as 68415 from image file identifier 06841501.jpg.

The museum object identifier is used with the base query to retrieve the contextual metadata in HTML: 'http://search.getty.edu/museum/records/musobject?objectid=68415'

02

To retrieve the XMP formatted metadata carried within this image, I use the ImageMagick convert command line tool, and show the summary line about the artwork with grep:

$ convert 06841501.jpg 06841501.xmp
$ cat 06841501.xmp | grep -A4 "<dc:description>"
  <dc:description>
   <rdf:Alt>
    <rdf:li xml:lang='x-default'>Sadie Pfeiffer, Spinner in Cotton Mill, North Carolina; Lewis W. Hine, American, 1874 - 1940; North Carolina, United States, North America; negative 1910; print about 1920s - 1930s; Gelatin silver print; Sheet: 28 x 35.7 cm (11 x 14 1/16 in.); 84.XM.967.15</rdf:li>
   </rdf:Alt>
  </dc:description> 

I have put a simple text file with the complete list of image file names and this summary line in the Downloads section at the bottom of this post.

Simplifying Content Delivery with Hierarchical Directories

Serving large numbers of high-resolution graphics for web and mobile content is a demanding task. The variety of customer display platforms ranges from handheld smartphones, to small and large tablets. New very high-resolution browsers on notebook and desktop computers can take advantage of much larger images as well. Web and mobile content providers should be with matching the best resolution images to the end-user display. Also, the increases in digital camera resolution are making originals bigger and bigger. Rather than just thumbnail and a single resized image, a range of resized images is needed by content providers.

So after a bit more digging on the Getty site, I found image versions served up in various sizes from two locations. For the Lewis Hine photograph I found:

http://www.getty.edu/art/collections/images/thumb/06841501-T.JPG - Thumbnail
http://www.getty.edu/art/collections/images/l/06841501.jpg - Large
http://www.getty.edu/art/collections/images/m/06841501.jpg - Medium
http://d2hiq5kf5j4p5h.cloudfront.net/06841501.jpg - Original

This illustrates the span of strategies for image hosting for web content delivery:

  • variation in filename for thumbnails.
  • variation in directory paths /l/ and /m/ for large and medium sizes of the same filename image
  • Object hosting for large size data transfers, in this case an Amazon CloudFront URL: d2hiq5kf5j4p5h.cloudfront.net where the first part of the URL is a hash indicating the customer, bucket, and server location.

Here is a set of reduced images I made using a Manta compute job and the original version which are hosted at:

https://us-east.manta.joyent.com/

/mantademo/public/images/getty-open/500x500_jpg/06841501.jpg - 0.25 Megapixel
/mantademo/public/images/getty-open/1000x1000_jpg/06841501.jpg - 1 Megapixel
/mantademo/public/images/getty-open/2000x2000_jpg/06841501.jpg - 4 Megapixel
/mantademo/public/images/getty-open/originals/06841501.jpg - Original

One of the most intuitive and powerful features of the Joyent Manta Storage Service is that it features hierarchical directories.

This may seem obvious, but cloud object storage services have not previously delivered this standard filesystem feature.

Here the Manta account I am using is mantademo and the public subdirectory is automatically set up for publishing content for open downloading.

The private subdirectory tree is under /mantademo/stor. Publishing content is as easy as moving it into the /mantademo/public subdirectory.

The image resize jobs I used to were run directly on the compute capacity of Manta using the Manta command line utilities.

In Part 3 of this series I will go over - in detail - the shell script I built that creates this entire set of resized images using ImageMagick, as well as details on color preservation, and lots of Manta tips so that you can reproduce the ideas developed in this example in other Manta compute scenarios.

Downloading and Validation Strategy

From the thumbnail directory listing and a bit of work with sed, I made a simple script that uses cURL to download the full-sized images off CloudFront and I used mput to push them up to my Manta account. That script is here. I downloaded the originals twice, once to my notebook and the other to Manta. I compared the two image sets to isolate any incomplete transfers, and verify that the collection is intact.

In fact one of the images on my notebook was only partially loaded by cURL, probably at the moment I switched my wifi source for faster bandwidth. The image set was not available as a single downloadable file with a standard checksum. So this double download and compare validation is an important step to ensure I have captured the image set correctly.

03

03-large

The broken download produced a truncated image on my notebook, file 06459301.jpg, of the Timothy H. O'Sullivan photograph, Field Where General Reynolds Fell, Gettysburg, taken in July 1863.

Download the Getty Open Set from Joyent Manta Storage Service

The complete validated set of images can be downloaded from these links, which are hosted on the Joyent Manta Storage Service:

Extracted Metadata - Image Descriptions File 1.08MB

0.25 Megapixel JPEG 292MB
- SHA-256 0aca1f5226f3091bb94cc9f78f24d4e1fe6f49f983530b475d39e1eb913add8c *130812_GettyOpen_500x500_jpg.tar

0.25 Megapixel WebP 172MB
- SHA-256 d4acd7856cd6f82e81949c65c3bed603ee8589a704ad268c89f6fbd726aff85a *130812_GettyOpen_500x500_webp.tar

1 Megapixel JPEG 1.01GB
- SHA-256 9eb6eeca968eb1ab066fdf35f62f2aeda7d481dab78f39c55d6b4900f4f50c38 *130812_GettyOpen_1000x1000_jpg.tar

1 Megapixel WebP 545MB
- SHA-256 b9b0ac17bb115cc9ed000c57dc69e13e0d558f3f51e79d56b77d023251d2daa0 *130812_GettyOpen_1000x1000_webp.tar

4 Megapixel JPEG 3.94GB
- SHA-256 bdb07a29d5fc6bbdc90ebed7896a1724f5e09b3b8d24a4762da6c010c12f0e05 *130812_GettyOpen_2000x2000_jpg.tar

4 Megapixel WebP 2.01GB
- SHA-256 616165c45700e2d7f61ecc1b22d277410a25f895ec47211b562b08985b70ddd7 *130812_GettyOpen_2000x2000_webp.tar

XMP Metadata 28.3MB
- SHA-256 5bfa3a9070782932501b00db6ea00399ca7a37fd8d3c04f8ede4637722211aa2 *130812_GettyOpen_metadata_xmp.tar

Remarkably, you can see how JPEG and WebP compare in the aggregate sizes of these equivalent tar archives, which are not further compressed by other protocols. This makes for a robust size comparison of the two formats across this large and diverse image set.

At .25 Megapixels, the WebP images are 41% smaller than the JPEG, at 1 Megapixel they are 46% smaller. and at 4 Megapixel they are 49% smaller. You can scroll back to the top and find links to the five largest images in the set in both formats at these resolutions and compare the image quality on a WebP compatible browser.

To wrap up, this blog post introduces the Getty Open Image set hosted on Manta and provides the downloads that get you the image set in resolutions suitable for further experimentation.

In follow-on parts of this series, I will dive deeper into the code and methodology I used to compute everything you see in this post, including validation, image conversion to WebP and resizing, color preservation, creating archives, computing checksums, and extracting metadata one-liners, all on the Joyent Manta Storage Service.

Sign up Now For
Instant Cloud Access

Get Started