Building Packages at Scale

tl;dr We are able to build 14,000 packages across 6 zones in 4.5 hours

At Joyent we have long had a focus on high performance, whether it's throughinnovations in SmartOS, carefully selecting ourhardware, or providingcustomers with tools such as DTrace toidentify bottlenecks in their application stacks.

When it comes to building packages for SmartOS it is no different. We want tobuild them as quickly as possible, using the fewest resources, but withoutsacrificing quality or consistency.

To give you an idea of how many packages we build, here are the numbers:

BranchArchSuccessFailTotal
2012Q4i3862,245202,265
2012Q4x86_642,244182,262
2013Q1i3862,303402,343
2013Q1x86_642,302392,341
2013Q2i38610,4791,27711,756
2013Q2x86_6410,2901,27211,562
2013Q3i38611,2861,31712,603
2013Q3x86_6411,2031,30812,511
2013Q4i38611,5721,27712,849
2013Q4x86_6411,4981,27012,786
2014Q1i38612,4501,17113,621
2014Q1x86_6412,3561,15013,506
2014Q2i38613,1321,25214,384
2014Q2x86_6413,1021,23114,333
Total139,122

Now of course we don't continuously attempt to build 139,122 packages.However, when something likeHeartbleed happens, we backport thefix to all of these branches, and a rebuild of something as heavily dependedupon as OpenSSL can cause around 100,000 packages to be rebuilt.

Each quarter we add another release branch to our builds, and as you can seefrom the numbers above (2013Q1 and earlier were limited builds) the totalnumber of packages in pkgsrc grows with each release.

Recently I've been focussing on improving the bulk build performance, both toensure that fixes such as Heartbleed are delivered as quickly as possible, andalso to ensure we aren't wasteful in our resource usage as our package countgrows. All of our builds happen in the Joyent public cloud, so any resourceswe are using are taking away from the available pool to sell to customers.

Let's first take a walk through pkgsrc bulk build history, and then look atsome of the performance wins I've been working on.

pkgsrc bulk builds, 2004

The oldest bulk build I performed that I can find is thisone. Mymemory is a little fuzzy on what hardware I was using at the time, but Ibelieve it was a SunFire v120 (1 x UltraSPARCIIi CPU @ 650MHz) with 2GB RAM.This particular build was on Solaris 8.

As you can see from the results page, it took 13.5 days to build 1,810 (andattempt but fail to build 1,128) packages!

Back then the build would have been single threaded with only one package beingbuilt at a time. There was no support for concurrent builds, make -jwouldn't have helped much, and essentially you just needed to be very patient.

May 2004: 2,938 packages in 13.5 days

pkgsrc bulk builds, 2010

Fast forward 6years. Atthis point I'm building on much faster x86-based hardware (a Q9550 Core2Quad @2.83GHz and 16G RAM) running Solaris 10, however the builds are still singlethreaded and take 4 days to build 5,524 (and attempt but fail to build 1,325)packages.

All of the speed increase is coming directly from faster hardware.

May 2010: 6,849 packages in 4 days

pkgsrc @ Joyent, 2012 onwards

Shortly after joiningJoyent, Istarted setting up our bulk build infrastructure. The first official buildfrom this was for general illumosuse.We were able to provide over 9,000 binary packages which took around 7 days tobuild.

At this point we're starting to see the introduction of very large packagessuch as qt4, kde4, webkit, etc. These packages take a significant amount oftime to build, so even though we are building on faster hardware thanpreviously, the combination of an increased package count as well as individualpackage build times increasing mean we're not seeing a reduction in total buildtime.

July 2012: 10,554 packages in 7 days

Performance improvements

At this point we start to look at ways of speeding up the builds themselves.As we have the ability to create build zones as required, the first step was tointroduce distributed builds.

pbulk distributed builds

For pkgsrc in the 2007 Google Summer of Code JörgSonnenberger wrote pbulk, a replacement for theolder bulk build infrastructure that had been serving us well since 2004 buthad started to show its age. One of the primary benefits of pbulk was that itsupported a client/server setup to distribute builds, and so I worked onbuilding across 6 separate zones. From my work log:

2012-09-25 (Tuesday) - New pbulk setup managed a full bulk build (9,414 packages) in 54 hours,   since then I've added another 2 clients which should get us well under 2   days.

September 2012: 10,634 packages in 2 days

Distributed chrooted builds

By far the biggest win so far was in June 2013, however I'm somewhat ashamedthat it took me so long to think of it. By this time we were already usingchroots for builds, as it ensures a clean and consistent build environment,keeps the host zone clean, and also allowed us to perform concurrent branchbuilds (e.g. building i386 and x86_64 packages simultaneously on the same hostbut in separate chroots).

What it took me 9 months to realise, however, was that we could simply usemultiple chroots for each branch build! This snippet from my log isenlightening:

2013-06-06 (Thursday) - Apply "duh, why didn't I think of that earlier" patch to the pbulk cluster   which will give us massively improved concurrency and much faster builds.2013-06-07 (Friday) - Initial results from the re-configured pbulk cluster show it can chew   through 10,000 packages in about 6 hours, producing 9,000 binary pkgs.   Not bad.  Continue tweaking to avoid build stalls with large dependent   packages (e.g. gcc/webkit/qt4).

Not bad indeed. The comment is somewhat misleading, though, as this comment Imade on IRC on June 15th alludes to:

22:27 < jperkin> jeez lang/mercury is a monster22:28 < jperkin> I can build over 10,000 packages in 18 hours, but that                 one package alone takes 7.5 hours before failing.

Multiple distributed chroots get us an awfully long way, but now we're stuckwith big packages which ruin our total build times, and no amount of additionalzones or chroots will help.

However, we are now under 24 hours for a full build for the first time. Thisis of massive benefit, as we can now do regular daily builds.

June 2013: 11,372 packages in 18 hours

make -j vs number of chroots

An ongoing effort has been to optimise the MAKE_JOBS setting used for eachpackage build, balanced against the number of concurrent chroots. There are anumber of factors to consider:

  • The vast majority of ./configure scripts are single threaded, so generallyyou should trade extra chroots for less MAKE_JOBS.
  • The same goes for other phases of the package build (fetch, checksum,extract, patch, install, package).
  • Packages which are highly depended upon (e.g. GCC, Perl, OpenSSL) should havea high MAKE_JOBS as even with a large number of chroots enabled, most ofthem will be idle waiting for those builds to complete.
  • Larger packages are built towards the end of a bulk build run (e.g. KDE,Firefox) and these tend to be large builds. Similar to above, as they arelater in the build there will be fewer chroots active, so a higher MAKE_JOBScan be afforded.

Large packages like webkit will happily burn as many cores as you give them andreturn you with faster build times, however giving them 24 dedicated coresisn't cost-effective. Our 6 build zones are sized at 16 cores / 16GB DRAM, andso far the sweet spot seems to be:

  • 8 chroots per build (bump to 16 if the build is performed whilst no otherbuilds are happening).
  • Default MAKE_JOBS=2.
  • MAKE_JOBS=4 for packages which don't have many dependents but are generallylarge builds which benefit from additional parallelism.
  • MAKE_JOBS=6 for webkit.
  • MAKE_JOBS=8 for highly-dependent packages which stall the build, and/or arebuilt right at the end.

The MAKE_JOBS value is determined based on the current PKGPATH and isdynamically generated so we can easily test new hypotheses.

With various tweaks in place, fixes to packages etc., we were running steady ataround 12 hrs for a full build.

August 2014: 14,017 packages in 12 hours

cwrappers

There are a number of unique technologies in pkgsrc that have been incrediblyuseful over the years. Probably the most useful has been the wrappers in ourbuildlink framework, whichallows compiler and linker commands to be analysed and modified before beingpassed to the real tool. For example:

# Remove any hardcoded GNU ld arguments unsupported by the SunOS linker.if ${OPSYS} == "SunOS"BUILDLINK_TRANSFORM+=  rm:-Wl,--as-needed.endif# Stop using -fomit-frame-pointer and producing useless binaries!  Transform# it to "-g" instead, just in case they forgot to add that too.BUILDLINK_TRANSFORM+=  opt:-fomit-frame-pointer:-g

There are a number of other features of the wrapper framework, however itdoesn't come without cost. The wrappers are written in shell, and fork a largenumber of sed and other commands to perform replacements. On platforms withan expensive fork() implementation this can have quite a detrimental effecton performance.

Jörg again was heavily involved in a fix for this, with his work oncwrappers,which replaced the shell scripts with C implementations. Despite being 99%complete, the final effort to get it over the line and integrated into pkgsrchadn't been finished, so in September 2014 I took on the task and the samplepackage results speak for themselves:

PackageLegacy wrappersC wrappersSpeedup
wireshark3,376 seconds1,098 seconds3.07x
webkit1-gtk11,684 seconds4,622 seconds2.52x
qt4-libs11,866 seconds5,134 seconds2.31x
xulrunner2410,574 seconds5,058 seconds2.09x
ghc62,026 seconds1,328 seconds1.52x

As well as reducing the overall build time, the significant reduction in numberof forks meant the system time was a lot lower, allowing us to increase thenumber of build chroots. The end result was a reduction of over 50% in overallbuild time!

The work is still ongoing to integrate this into pkgsrc, and we hope to have itdone for pkgsrc-2014Q4.

September 2014: 14,011 packages in 5 hours 20 minutes

Miscellaneous fork improvements

Prior to working on cwrappers I was looking at other ways to reduce the numberof forks, using DTrace to monitor each internal pkgsrc phase. For example thebmake wrapper phase generates a shadow tree of symlinks, and in packages witha large number of dependencies this was taking a long time.

Running DTrace to count totals of execnames showed:

$ dtrace -n 'syscall::exece:return { @num[execname] = count(); }'  [...]  grep                                                             94  sort                                                            164  nbsed                                                           241  mkdir                                                           399  bash                                                            912  cat                                                            3893  ln                                                             7631  rm                                                             7766  dirname                                                        7769

Looking through the code showed a number of ways to reduce the large number offorks happening here

cat -> echo

cat was being used to generate a sed script, which sections such as:

cat <

There's no need to fork here, we can just use the builtin echo command instead:

echo "s|^$1\(/[^$_sep]*\.la[$_sep]\)|$2\1|g"echo "s|^$1\(/[^$_sep]*\.la\)$|$2\1|g"

Use shell substitution where possible

The dirname commands were being operated on full paths to files, and in thiscase we can simply use POSIX shell substitution instead, i.e.:

dir=`dirname $file`

becomes:

dir="${file%/*}"

Again, this saves a fork each time. This substitution isn't always possible,for example if you have trailing slashes, but in our case we were sure that$file was correctly formed.

Test before exec

The rm commands were being unconditionally executed in a loop:

for file; do  rm -f $file  ..create file..done

This is an expensive operation when you are running it on thousands of fileseach time, so simply test for the file first and use a cheap (and builtin)stat(2) call instead of forking an expensive unlink(2) for the majority ofcases.

for file; do  if [ -e $file ]; then    rm -f $file  fi  ..create file..done

At this point we had removed most of the forks, with DTrace confirming:

$ dtrace -n 'syscall::exece:return { @num[execname] = count(); }'  [...]  [ full output snipped for brevity ]  grep                                                             94  cat                                                             106  sort                                                            164  nbsed                                                           241  mkdir                                                           399  bash                                                            912  ln                                                             7631

The result was a big improvement, going from this:

$ ptime bmake wrapper  real     2:26.094442113  user       32.463077360  sys      1:48.647178135

to this:

$ ptime bmake wrapper  real       49.648642097  user       14.952946135  sys        33.989975053

Again note, not only are we reducing the overall runtime, but the system timeis significantly less, improving overall throughput and reducing contention onthe build zones.

Batch up commands

The most recent changes I've been working on have been to further reduce forksboth by caching results and batching up commands where possible. Taking theprevious example again, the ln commands are a result of a loop similar to:

while read src dst; do  src=..modify src..  dst=..modify dst..  ln -s $src $dstdone

Initially I didn't see any way to optimise this, but upon reading the lnmanpage I observed the second form of the command which allows you to symlinkmultiple files into a directory at once, for example:

$ ln -s /one /two /three dir$ ls -l dirlrwxr-xr-x 1 jperkin staff 4 Oct  3 15:40 one -> /onelrwxr-xr-x 1 jperkin staff 6 Oct  3 15:40 three -> /threelrwxr-xr-x 1 jperkin staff 4 Oct  3 15:40 two -> /two

As it happens, this is ideally suited to our task as $src and $dst will forthe most part have the same basename.

Writing some awk allows us to batch up the commands and do something likethis:

while read src dst; do  src=..modify src..  dst=..modify dst..  echo "$src:$dst"done | awk -F: '
{  src = srcfile = $1;  dest = destfile = destdir = $2;  sub(/.*\//, "", srcfile);  sub(/.*\//, "", destfile);  sub(/\/[^\/]*$/, "", destdir);  #  # If the files have the same name, add them to the per-directory list  # and use the 'ln file1 file2 file3 dir/' style, otherwise perform a  # standard 'ln file1 dir/link1' operation.  #  if (srcfile == destfile) {    if (destdir in links)      links[destdir] = links[destdir] " " src    else      links[destdir] = src;  } else {    renames[dest] = src;  }  #  # Keep a list of directories we've seen, so that we can batch them up  # into a single 'mkdir -p' command.  #  if (!(destdir in seendirs)) {    seendirs[destdir] = 1;    if (dirs)      dirs = dirs " " destdir;    else      dirs = destdir;  }}END {  #  # Print output suitable for piping to sh.  #  if (dirs)    print "mkdir -p " dirs;  for (dir in links)    print "ln -fs " links[dir] " " dir;  for (dest in renames)    print "ln -fs " renames[dest] " " dest;}
' | sh

There's an additional optimisation here too - we keep track of all thedirectories we need to create, and then batch them up into a single mkdir -pcommand.

Whilst this adds a considerable amount of code to what was originally a simpleloop, the results are certainly worth it. The time for bmake wrapper inkde-workspace4 which has a large number of dependencies (and therefore symlinksrequired) reduces from 2m11s to just 19 seconds.

Batching wrapper creation: 7x speedup

Cache results

One of the biggest recent wins was in a piece of code which checks each ELFbinary's DT_NEEDED and DT_RPATH to ensure they are correct and that we haverecorded the correct dependencies. Written in awk there were a couple oflocations where it forked a shell to run commands:

cmd = "pkg_info -Fe " fileif (cmd | getline pkg) {...if (!system("test -f " libfile))) {

These were in functions that were called repeatedly for each file we werechecking, and in a large package there may be lots of binaries and librarieswhich need checking. By caching the results like this:

if (file in pkgcache)  pkg = pkgcache[file]else  cmd = "pkg_info -Fe " file  if (cmd | getline pkg) {     pkgcache[file] = pkg...if (!(libfile in libcache))  libcache[libfile] = system("test -f " libfile)if (!libcache[libfile]) {

This simple change made a massive difference! The kde-workspace4 packageincludes a large number of files to be checked, and the results went from this:

$ ptime bmake _check-shlibs=> Checking for missing run-time search paths in kde-workspace4-4.11.5nb5real     7:55.251878017user     2:08.013799404sys      5:14.145580838$ dtrace -n 'syscall::exece:return { @num[execname] = count(); }'dtrace: description 'syscall::exece:return ' matched 1 probe  [...]  greadelf                                                        298  pkg_info                                                       5809  ksh93                                                         95612

to this:

$ ptime bmake _check-shlibs=> Checking for missing run-time search paths in kde-workspace4-4.11.5nb5real       18.503489661user        6.115494568sys        11.551809938$ dtrace -n 'syscall::exece:return { @num[execname] = count(); }'dtrace: description 'syscall::exece:return ' matched 1 probe  [...]  pkg_info                                                        114  greadelf                                                        298  ksh93                                                          3028

Cache awk system() results: 25x speedup

Avoid unnecessary tests

The biggest win so far though was also the simplest. One of the pkgsrc testschecks all files in a newly-created package for any #! paths which point tonon-existent interpreters. However, do we really need to test all files?Some packages have thousands of files, and in my opinion, there's no need tocheck files which are not executable.

We went from this:

if [ -f $file ]; then  ..test $file..
real     1:36.154904091user       17.554778405sys      1:10.566866515

to this:

if [ -x $file ]; then  ..test $file..
real        2.658741177user        1.339411743sys         1.236949825

Again DTrace helped in identifying the hot path (30,000+ sed calls in thiscase) and narrowing down where to concentrate efforts.

Only test shebang in executable files: ~50x speedup

Miscellaneous system improvements

Finally, there have been some other general improvements I've implemented overthe past few months.

bash -> dash

dash is renowned as being a leaner, faster shell than bash, and I'vecertainly observed this when switching to it as the default $SHELL in builds.The normal concern is that there may be non-POSIX shell constructs in use, e.g.brace expansion, but I've observed relatively few of these, with the resultsbeing (prior to some of the other performance changes going in):

ShellSuccessful packagesAverage total build time
bash13,0505hr 25m
dash13,0205hr 10m

It's likely with a small bit of work fixing non-portable constructs we canbring the package count for dash up to the same level. Note that theslightly reduced package count does not explain the reduced build time, asthose failed packages have enough time to complete successfully before otherlarger builds we're waiting on are completed anyway.

Fix libtool to use printf builtin

libtool has a build-time test to see which command it should call for advanced printing:

# Test print first, because it will be a builtin if present.if test "X`( print -r -- -n ) 2>/dev/null`" = X-n && \   test "X`print -r -- $ECHO 2>/dev/null`" = "X$ECHO"; then  ECHO='print -r --'elif test "X`printf %s $ECHO 2>/dev/null`" = "X$ECHO"; then  ECHO='printf %s\n'else  # Use this function as a fallback that always works.  func_fallback_echo ()  {    eval 'cat <<_LTECHO_EOF$[]1_LTECHO_EOF'  }  ECHO='func_fallback_echo'fi

Unfortunately on SunOS, there is an actual /usr/bin/print command, thanks toksh93 polluting the namespace. libtool finds it and so prefers it over printf,which is a problem as there is no print in the POSIX spec, so neither dashnor bash implement it as a builtin.

Again, this is unnecessary forking that we want to fix (libtool is called alot during a full bulk build!) Thankfully pkgsrc makes this easy - we canjust create a broken print command which will be found before/usr/bin/print:

.PHONY: create-print-wrapperpost-wrapper: create-print-wrappercreate-print-wrapper:  ${PRINTF} '#!/bin/sh\nfalse\n' > ${WRAPPER_DIR}/bin/print  ${CHMOD} +x ${WRAPPER_DIR}/bin/print

saving us millions of needless execs.

Parallelise where possible

There are a couple of areas where the pkgsrc bulk build was single threaded:

Initial package tools bootstrap

It was possible to speed up the bootstrap phase by adding custom make -jsupport, reducing the time by a few minutes.

Package checksum generation

Checksum generation was initially performed at the end of the build runningacross all of the generated packages, so an obvious fix for this was to performindividual package checksum generation in each build chroot after the packagebuild had finished and then simply gather up the results at the end.

pkg_summary.gz generation

Similarly for pkg_summary.gz we can generate individual per-package pkg_info -X output and then collate it at the end.

Optimising these single-threaded sections of the build resulted in around 20minutes being taken off the total runtime.

Summary

The most recent build with all these improvements integrated together ishere,showing a full from-scratch bulk build taking under 5 hours to build over14,000 packages. We've come a long way since 2004:

DatePackage BuildsTotal Build Time (hours)
2004/052,938322
2010/056,849100.5
2012/0710,554166.5
2012/1010,63448
2013/0611,37218
2014/0814,01712
2014/1014,1624.5

We've achieved this through a number of efforts:

  • Distributed builds to scale across multiple hosts.
  • Chrooted builds to scale on individual hosts.
  • Tweaking make -j according to per-package effectiveness.
  • Replacing scripts with C implementations in critical paths.
  • Reducing forks by caching, batching commands, and using shell builtins wherepossible.
  • Using faster shells.
  • Parallelising single-threaded sections where possible.

What's next? There are plenty of areas for further improvements:

  • Improved scheduling to avoid builds with high MAKE_JOBS from sharing thesame build zone.
  • make(1) variable caching between sub-makes.
  • Replace /bin/sh on illumos (ksh93) with dash (even if there is no appetitefor this upstream, thanks to chroots we can just mount it as /bin/sh insideeach chroot!)
  • Dependency graph analysis to focus on packages with the most dependencies.
  • Avoid the "long tail" by getting the final few large packages building asearly as possible.
  • Building in memory file systems if build size permits.
  • Avoid building multiple copies of libnbcompat during bootstrap.

Many thanks to Jörg for writing pbulk and cwrappers, Google for sponsoring thepbulk GSoC , the pkgsrc developers for all their hard work in adding andupdating packages, and of course Joyent for employing me to work on this stuff.



Post written by Jonathan Perkin