Building Packages at Scale
tl;dr We are able to build 14,000 packages across 6 zones in 4.5 hours
At Joyent we have long had a focus on high performance, whether it's throughinnovations in SmartOS, carefully selecting ourhardware, or providingcustomers with tools such as DTrace toidentify bottlenecks in their application stacks.
When it comes to building packages for SmartOS it is no different. We want tobuild them as quickly as possible, using the fewest resources, but withoutsacrificing quality or consistency.
To give you an idea of how many packages we build, here are the numbers:
Branch | Arch | Success | Fail | Total |
---|---|---|---|---|
2012Q4 | i386 | 2,245 | 20 | 2,265 |
2012Q4 | x86_64 | 2,244 | 18 | 2,262 |
2013Q1 | i386 | 2,303 | 40 | 2,343 |
2013Q1 | x86_64 | 2,302 | 39 | 2,341 |
2013Q2 | i386 | 10,479 | 1,277 | 11,756 |
2013Q2 | x86_64 | 10,290 | 1,272 | 11,562 |
2013Q3 | i386 | 11,286 | 1,317 | 12,603 |
2013Q3 | x86_64 | 11,203 | 1,308 | 12,511 |
2013Q4 | i386 | 11,572 | 1,277 | 12,849 |
2013Q4 | x86_64 | 11,498 | 1,270 | 12,786 |
2014Q1 | i386 | 12,450 | 1,171 | 13,621 |
2014Q1 | x86_64 | 12,356 | 1,150 | 13,506 |
2014Q2 | i386 | 13,132 | 1,252 | 14,384 |
2014Q2 | x86_64 | 13,102 | 1,231 | 14,333 |
Total | 139,122 |
Now of course we don't continuously attempt to build 139,122 packages.However, when something likeHeartbleed happens, we backport thefix to all of these branches, and a rebuild of something as heavily dependedupon as OpenSSL can cause around 100,000 packages to be rebuilt.
Each quarter we add another release branch to our builds, and as you can seefrom the numbers above (2013Q1 and earlier were limited builds) the totalnumber of packages in pkgsrc grows with each release.
Recently I've been focussing on improving the bulk build performance, both toensure that fixes such as Heartbleed are delivered as quickly as possible, andalso to ensure we aren't wasteful in our resource usage as our package countgrows. All of our builds happen in the Joyent public cloud, so any resourceswe are using are taking away from the available pool to sell to customers.
Let's first take a walk through pkgsrc bulk build history, and then look atsome of the performance wins I've been working on.
pkgsrc bulk builds, 2004
The oldest bulk build I performed that I can find is thisone. Mymemory is a little fuzzy on what hardware I was using at the time, but Ibelieve it was a SunFire v120 (1 x UltraSPARCIIi CPU @ 650MHz) with 2GB RAM.This particular build was on Solaris 8.
As you can see from the results page, it took 13.5 days to build 1,810 (andattempt but fail to build 1,128) packages!
Back then the build would have been single threaded with only one package beingbuilt at a time. There was no support for concurrent builds, make -j
wouldn't have helped much, and essentially you just needed to be very patient.
May 2004: 2,938 packages in 13.5 days
pkgsrc bulk builds, 2010
Fast forward 6years. Atthis point I'm building on much faster x86-based hardware (a Q9550 Core2Quad @2.83GHz and 16G RAM) running Solaris 10, however the builds are still singlethreaded and take 4 days to build 5,524 (and attempt but fail to build 1,325)packages.
All of the speed increase is coming directly from faster hardware.
May 2010: 6,849 packages in 4 days
pkgsrc @ Joyent, 2012 onwards
Shortly after joiningJoyent, Istarted setting up our bulk build infrastructure. The first official buildfrom this was for general illumosuse.We were able to provide over 9,000 binary packages which took around 7 days tobuild.
At this point we're starting to see the introduction of very large packagessuch as qt4, kde4, webkit, etc. These packages take a significant amount oftime to build, so even though we are building on faster hardware thanpreviously, the combination of an increased package count as well as individualpackage build times increasing mean we're not seeing a reduction in total buildtime.
July 2012: 10,554 packages in 7 days
Performance improvements
At this point we start to look at ways of speeding up the builds themselves.As we have the ability to create build zones as required, the first step was tointroduce distributed builds.
pbulk distributed builds
For pkgsrc in the 2007 Google Summer of Code JörgSonnenberger wrote pbulk, a replacement for theolder bulk build infrastructure that had been serving us well since 2004 buthad started to show its age. One of the primary benefits of pbulk was that itsupported a client/server setup to distribute builds, and so I worked onbuilding across 6 separate zones. From my work log:
2012-09-25 (Tuesday) - New pbulk setup managed a full bulk build (9,414 packages) in 54 hours, since then I've added another 2 clients which should get us well under 2 days.
September 2012: 10,634 packages in 2 days
Distributed chrooted builds
By far the biggest win so far was in June 2013, however I'm somewhat ashamedthat it took me so long to think of it. By this time we were already usingchroots for builds, as it ensures a clean and consistent build environment,keeps the host zone clean, and also allowed us to perform concurrent branchbuilds (e.g. building i386 and x86_64 packages simultaneously on the same hostbut in separate chroots).
What it took me 9 months to realise, however, was that we could simply usemultiple chroots for each branch build! This snippet from my log isenlightening:
2013-06-06 (Thursday) - Apply "duh, why didn't I think of that earlier" patch to the pbulk cluster which will give us massively improved concurrency and much faster builds.2013-06-07 (Friday) - Initial results from the re-configured pbulk cluster show it can chew through 10,000 packages in about 6 hours, producing 9,000 binary pkgs. Not bad. Continue tweaking to avoid build stalls with large dependent packages (e.g. gcc/webkit/qt4).
Not bad indeed. The comment is somewhat misleading, though, as this comment Imade on IRC on June 15th alludes to:
22:27 < jperkin> jeez lang/mercury is a monster22:28 < jperkin> I can build over 10,000 packages in 18 hours, but that one package alone takes 7.5 hours before failing.
Multiple distributed chroots get us an awfully long way, but now we're stuckwith big packages which ruin our total build times, and no amount of additionalzones or chroots will help.
However, we are now under 24 hours for a full build for the first time. Thisis of massive benefit, as we can now do regular daily builds.
June 2013: 11,372 packages in 18 hours
make -j vs number of chroots
An ongoing effort has been to optimise the MAKE_JOBS
setting used for eachpackage build, balanced against the number of concurrent chroots. There are anumber of factors to consider:
- The vast majority of
./configure
scripts are single threaded, so generallyyou should trade extra chroots for lessMAKE_JOBS
. - The same goes for other phases of the package build (fetch, checksum,extract, patch, install, package).
- Packages which are highly depended upon (e.g. GCC, Perl, OpenSSL) should havea high
MAKE_JOBS
as even with a large number of chroots enabled, most ofthem will be idle waiting for those builds to complete. - Larger packages are built towards the end of a bulk build run (e.g. KDE,Firefox) and these tend to be large builds. Similar to above, as they arelater in the build there will be fewer chroots active, so a higher MAKE_JOBScan be afforded.
Large packages like webkit will happily burn as many cores as you give them andreturn you with faster build times, however giving them 24 dedicated coresisn't cost-effective. Our 6 build zones are sized at 16 cores / 16GB DRAM, andso far the sweet spot seems to be:
- 8 chroots per build (bump to 16 if the build is performed whilst no otherbuilds are happening).
- Default
MAKE_JOBS=2
. MAKE_JOBS=4
for packages which don't have many dependents but are generallylarge builds which benefit from additional parallelism.MAKE_JOBS=6
for webkit.MAKE_JOBS=8
for highly-dependent packages which stall the build, and/or arebuilt right at the end.
The MAKE_JOBS
value is determined based on the current PKGPATH
and isdynamically generated so we can easily test new hypotheses.
With various tweaks in place, fixes to packages etc., we were running steady ataround 12 hrs for a full build.
August 2014: 14,017 packages in 12 hours
cwrappers
There are a number of unique technologies in pkgsrc that have been incrediblyuseful over the years. Probably the most useful has been the wrappers in ourbuildlink framework, whichallows compiler and linker commands to be analysed and modified before beingpassed to the real tool. For example:
# Remove any hardcoded GNU ld arguments unsupported by the SunOS linker.if ${OPSYS} == "SunOS"BUILDLINK_TRANSFORM+= rm:-Wl,--as-needed.endif# Stop using -fomit-frame-pointer and producing useless binaries! Transform# it to "-g" instead, just in case they forgot to add that too.BUILDLINK_TRANSFORM+= opt:-fomit-frame-pointer:-g
There are a number of other features of the wrapper framework, however itdoesn't come without cost. The wrappers are written in shell, and fork a largenumber of sed
and other commands to perform replacements. On platforms withan expensive fork()
implementation this can have quite a detrimental effecton performance.
Jörg again was heavily involved in a fix for this, with his work oncwrappers,which replaced the shell scripts with C implementations. Despite being 99%complete, the final effort to get it over the line and integrated into pkgsrchadn't been finished, so in September 2014 I took on the task and the samplepackage results speak for themselves:
Package | Legacy wrappers | C wrappers | Speedup |
---|---|---|---|
wireshark | 3,376 seconds | 1,098 seconds | 3.07x |
webkit1-gtk | 11,684 seconds | 4,622 seconds | 2.52x |
qt4-libs | 11,866 seconds | 5,134 seconds | 2.31x |
xulrunner24 | 10,574 seconds | 5,058 seconds | 2.09x |
ghc6 | 2,026 seconds | 1,328 seconds | 1.52x |
As well as reducing the overall build time, the significant reduction in numberof forks meant the system time was a lot lower, allowing us to increase thenumber of build chroots. The end result was a reduction of over 50% in overallbuild time!
The work is still ongoing to integrate this into pkgsrc, and we hope to have itdone for pkgsrc-2014Q4.
September 2014: 14,011 packages in 5 hours 20 minutes
Miscellaneous fork improvements
Prior to working on cwrappers I was looking at other ways to reduce the numberof forks, using DTrace to monitor each internal pkgsrc phase. For example thebmake wrapper
phase generates a shadow tree of symlinks, and in packages witha large number of dependencies this was taking a long time.
Running DTrace to count totals of execnames showed:
$ dtrace -n 'syscall::exece:return { @num[execname] = count(); }' [...] grep 94 sort 164 nbsed 241 mkdir 399 bash 912 cat 3893 ln 7631 rm 7766 dirname 7769
Looking through the code showed a number of ways to reduce the large number offorks happening here
cat -> echo
cat
was being used to generate a sed
script, which sections such as:
cat <
There's no need to fork here, we can just use the builtin
echo
command instead:
echo "s|^$1\(/[^$_sep]*\.la[$_sep]\)|$2\1|g"echo "s|^$1\(/[^$_sep]*\.la\)$|$2\1|g"
Use shell substitution where possible
The
dirname
commands were being operated on full paths to files, and in thiscase we can simply use POSIX shell substitution instead, i.e.:
dir=`dirname $file`
becomes:
dir="${file%/*}"
Again, this saves a fork each time. This substitution isn't always possible,for example if you have trailing slashes, but in our case we were sure that
$file
was correctly formed.
Test before exec
The
rm
commands were being unconditionally executed in a loop:
for file; do rm -f $file ..create file..done
This is an expensive operation when you are running it on thousands of fileseach time, so simply test for the file first and use a cheap (and builtin)
stat(2)
call instead of forking an expensive unlink(2)
for the majority ofcases.
for file; do if [ -e $file ]; then rm -f $file fi ..create file..done
At this point we had removed most of the forks, with DTrace confirming:
$ dtrace -n 'syscall::exece:return { @num[execname] = count(); }' [...] [ full output snipped for brevity ] grep 94 cat 106 sort 164 nbsed 241 mkdir 399 bash 912 ln 7631
The result was a big improvement, going from this:
$ ptime bmake wrapper real 2:26.094442113 user 32.463077360 sys 1:48.647178135
to this:
$ ptime bmake wrapper real 49.648642097 user 14.952946135 sys 33.989975053
Again note, not only are we reducing the overall runtime, but the system timeis significantly less, improving overall throughput and reducing contention onthe build zones.
Batch up commands
The most recent changes I've been working on have been to further reduce forksboth by caching results and batching up commands where possible. Taking theprevious example again, the
ln
commands are a result of a loop similar to:
while read src dst; do src=..modify src.. dst=..modify dst.. ln -s $src $dstdone
Initially I didn't see any way to optimise this, but upon reading the
ln
manpage I observed the second form of the command which allows you to symlinkmultiple files into a directory at once, for example:
$ ln -s /one /two /three dir$ ls -l dirlrwxr-xr-x 1 jperkin staff 4 Oct 3 15:40 one -> /onelrwxr-xr-x 1 jperkin staff 6 Oct 3 15:40 three -> /threelrwxr-xr-x 1 jperkin staff 4 Oct 3 15:40 two -> /two
As it happens, this is ideally suited to our task as
$src
and $dst
will forthe most part have the same basename.
Writing some
awk
allows us to batch up the commands and do something likethis:
while read src dst; do src=..modify src.. dst=..modify dst.. echo "$src:$dst"done | awk -F: '
{ src = srcfile = $1; dest = destfile = destdir = $2; sub(/.*\//, "", srcfile); sub(/.*\//, "", destfile); sub(/\/[^\/]*$/, "", destdir); # # If the files have the same name, add them to the per-directory list # and use the 'ln file1 file2 file3 dir/' style, otherwise perform a # standard 'ln file1 dir/link1' operation. # if (srcfile == destfile) { if (destdir in links) links[destdir] = links[destdir] " " src else links[destdir] = src; } else { renames[dest] = src; } # # Keep a list of directories we've seen, so that we can batch them up # into a single 'mkdir -p' command. # if (!(destdir in seendirs)) { seendirs[destdir] = 1; if (dirs) dirs = dirs " " destdir; else dirs = destdir; }}END { # # Print output suitable for piping to sh. # if (dirs) print "mkdir -p " dirs; for (dir in links) print "ln -fs " links[dir] " " dir; for (dest in renames) print "ln -fs " renames[dest] " " dest;}
' | sh
There's an additional optimisation here too - we keep track of all thedirectories we need to create, and then batch them up into a single
mkdir -p
command.
Whilst this adds a considerable amount of code to what was originally a simpleloop, the results are certainly worth it. The time for
bmake wrapper
inkde-workspace4 which has a large number of dependencies (and therefore symlinksrequired) reduces from 2m11s to just 19 seconds.
Batching wrapper creation: 7x speedup
Cache results
One of the biggest recent wins was in a piece of code which checks each ELFbinary's
DT_NEEDED
and DT_RPATH
to ensure they are correct and that we haverecorded the correct dependencies. Written in awk there were a couple oflocations where it forked a shell to run commands:
cmd = "pkg_info -Fe " fileif (cmd | getline pkg) {...if (!system("test -f " libfile))) {
These were in functions that were called repeatedly for each file we werechecking, and in a large package there may be lots of binaries and librarieswhich need checking. By caching the results like this:
if (file in pkgcache) pkg = pkgcache[file]else cmd = "pkg_info -Fe " file if (cmd | getline pkg) { pkgcache[file] = pkg...if (!(libfile in libcache)) libcache[libfile] = system("test -f " libfile)if (!libcache[libfile]) {
This simple change made a massive difference! The kde-workspace4 packageincludes a large number of files to be checked, and the results went from this:
$ ptime bmake _check-shlibs=> Checking for missing run-time search paths in kde-workspace4-4.11.5nb5real 7:55.251878017user 2:08.013799404sys 5:14.145580838$ dtrace -n 'syscall::exece:return { @num[execname] = count(); }'dtrace: description 'syscall::exece:return ' matched 1 probe [...] greadelf 298 pkg_info 5809 ksh93 95612
to this:
$ ptime bmake _check-shlibs=> Checking for missing run-time search paths in kde-workspace4-4.11.5nb5real 18.503489661user 6.115494568sys 11.551809938$ dtrace -n 'syscall::exece:return { @num[execname] = count(); }'dtrace: description 'syscall::exece:return ' matched 1 probe [...] pkg_info 114 greadelf 298 ksh93 3028
Cache awk system() results: 25x speedup
Avoid unnecessary tests
The biggest win so far though was also the simplest. One of the pkgsrc testschecks all files in a newly-created package for any
#!
paths which point tonon-existent interpreters. However, do we really need to test all files?Some packages have thousands of files, and in my opinion, there's no need tocheck files which are not executable.
We went from this:
if [ -f $file ]; then ..test $file..
real 1:36.154904091user 17.554778405sys 1:10.566866515
to this:
if [ -x $file ]; then ..test $file..
real 2.658741177user 1.339411743sys 1.236949825
Again DTrace helped in identifying the hot path (30,000+
sed
calls in thiscase) and narrowing down where to concentrate efforts.
Only test shebang in executable files: ~50x speedup
Miscellaneous system improvements
Finally, there have been some other general improvements I've implemented overthe past few months.
bash -> dash
dash
is renowned as being a leaner, faster shell than bash
, and I'vecertainly observed this when switching to it as the default $SHELL
in builds.The normal concern is that there may be non-POSIX shell constructs in use, e.g.brace expansion, but I've observed relatively few of these, with the resultsbeing (prior to some of the other performance changes going in):
Shell | Successful packages | Average total build time |
---|---|---|
bash | 13,050 | 5hr 25m |
dash | 13,020 | 5hr 10m |
It's likely with a small bit of work fixing non-portable constructs we canbring the package count for
dash
up to the same level. Note that theslightly reduced package count does not explain the reduced build time, asthose failed packages have enough time to complete successfully before otherlarger builds we're waiting on are completed anyway.
Fix libtool to use printf builtin
libtool has a build-time test to see which command it should call for advanced printing:
# Test print first, because it will be a builtin if present.if test "X`( print -r -- -n ) 2>/dev/null`" = X-n && \ test "X`print -r -- $ECHO 2>/dev/null`" = "X$ECHO"; then ECHO='print -r --'elif test "X`printf %s $ECHO 2>/dev/null`" = "X$ECHO"; then ECHO='printf %s\n'else # Use this function as a fallback that always works. func_fallback_echo () { eval 'cat <<_LTECHO_EOF$[]1_LTECHO_EOF' } ECHO='func_fallback_echo'fi
Unfortunately on SunOS, there is an actual
/usr/bin/print
command, thanks toksh93 polluting the namespace. libtool finds it and so prefers it over printf,which is a problem as there is no print
in the POSIX spec, so neither dashnor bash implement it as a builtin.
Again, this is unnecessary forking that we want to fix (libtool is called alot during a full bulk build!) Thankfully pkgsrc makes this easy - we canjust create a broken
print
command which will be found before/usr/bin/print
:
.PHONY: create-print-wrapperpost-wrapper: create-print-wrappercreate-print-wrapper: ${PRINTF} '#!/bin/sh\nfalse\n' > ${WRAPPER_DIR}/bin/print ${CHMOD} +x ${WRAPPER_DIR}/bin/print
saving us millions of needless execs.
Parallelise where possible
There are a couple of areas where the pkgsrc bulk build was single threaded:
Initial package tools bootstrap
It was possible to speed up the bootstrap phase by adding custom
make -j
support, reducing the time by a few minutes.
Package checksum generation
Checksum generation was initially performed at the end of the build runningacross all of the generated packages, so an obvious fix for this was to performindividual package checksum generation in each build chroot after the packagebuild had finished and then simply gather up the results at the end.
pkg_summary.gz generation
Similarly for
pkg_summary.gz
we can generate individual per-package pkg_info -X
output and then collate it at the end.
Optimising these single-threaded sections of the build resulted in around 20minutes being taken off the total runtime.
Summary
The most recent build with all these improvements integrated together ishere,showing a full from-scratch bulk build taking under 5 hours to build over14,000 packages. We've come a long way since 2004:
Date | Package Builds | Total Build Time (hours) |
---|---|---|
2004/05 | 2,938 | 322 |
2010/05 | 6,849 | 100.5 |
2012/07 | 10,554 | 166.5 |
2012/10 | 10,634 | 48 |
2013/06 | 11,372 | 18 |
2014/08 | 14,017 | 12 |
2014/10 | 14,162 | 4.5 |
We've achieved this through a number of efforts:
Distributed builds to scale across multiple hosts.
Chrooted builds to scale on individual hosts.
Tweaking
make -j
according to per-package effectiveness.Replacing scripts with C implementations in critical paths.
Reducing forks by caching, batching commands, and using shell builtins wherepossible.
Using faster shells.
Parallelising single-threaded sections where possible.
What's next? There are plenty of areas for further improvements:
Improved scheduling to avoid builds with high
MAKE_JOBS
from sharing thesame build zone.make(1)
variable caching between sub-makes.Replace
/bin/sh
on illumos (ksh93) with dash (even if there is no appetitefor this upstream, thanks to chroots we can just mount it as/bin/sh
insideeach chroot!)Dependency graph analysis to focus on packages with the most dependencies.
Avoid the "long tail" by getting the final few large packages building asearly as possible.
Building in memory file systems if build size permits.
Avoid building multiple copies of libnbcompat during bootstrap.
Many thanks to Jörg for writing pbulk and cwrappers, Google for sponsoring thepbulk GSoC , the pkgsrc developers for all their hard work in adding andupdating packages, and of course Joyent for employing me to work on this stuff.
Post written by Jonathan Perkin