How scalable is SCons?

The marquee feature in ElectricAccelerator 5.0 is Electrify, a new front-end to the Accelerator cluster that allows us to distribute work from a wide variety of processes in addition to the make-based processes that we have always managed. One example is SCons, an alternative build system implemented in Python that has a small (compared to make) but apparently growing (slowly) market share. It’s sometimes touted as an ideal replacement for make, with a long list of reasons why it is considered superior. But not everybody likes it. Some have reported significant performance problems. Even the SCons maintainers agree, SCons “Can get slow on big projects”.

Of course that caught my eye, since making big projects build fast is what I do. What exactly does it mean that SCons “can get slow” on “big” projects? How slow is slow? How big is big? So to satisfy my own curiosity, and so that I might better advise customers seeking to use SCons with Electrify, I set out to answer those questions. All I needed was some free hardware and some time. Lots and lots and lots of time.

Read the rest of this entry »

Automating around scarcity by using virtual resources

[posted on behalf of Usman Muzaffar, who is on a long flight with no WiFi]

Here’s a sobering truth that shows up often in software automation: people are way better at sharing stuff than computers are. For example: say you have a scarce resource, like a box with special hardware or a service with serial access. You’re tasked with automating a software build/test/release workflow, and part of it needs to talk to this One Big And Fancy Thing. Do you try to teach your build script good playground behavior, so it automatically knows when to wait politely (and when, as deadlines approach, it should bully its way to the top of the slide), or do you declare this problem out-of-scope, and just provide the hook to let the team manage access manually?

The default on that checkbox is: *don’t automate*, for two reasons. First: letting people handle it means no extra work. More importantly, because we’ve been doing it our whole life, we’re actually pretty good at adapting to environments where we have to share things, whether that’s roads or restrooms or rack space. A small number of people on the same team with similar goals will usual self-organize around a few ground rules with a minimum of fuss. One clear and crisply delivered directive at a weekly team meeting (“OK guys the new 32-way sol box is for the full server test suite, so give that priority and check with each other before you use it for other stuff”) is often all it takes.

Second, technically getting the semantics of shared simultaneous access right is a notorious pain in the neck. As in any software automation system, there’s no credit for a partial answer: it’s a net loss if your script still needs a babysitter for the corner cases. So that means your solution needs to take selection and queuing and load into account, and have mechanisms for priority and pre-emption and be smart about busted network connections. More fundamentally, at its core it usually boils down to something awfully close to multithreaded programming, with the usual challenges in that space around semaphores, locks, deadlocks, races. Great stuff in a CS course or maybe your server’s ConnectionPool class — rathole alert in your build and test system!

So, largely with good reason, the automation train comes to a screeching halt right here. It’s just not worth the effort to build a system that’s going to manage the synchronization for parallel access to scarce resources. In other words: when shared resources wind up in the software production system, people show up next to them, and that sucks all the fun (and potential efficiency gains) out of automation. What to do?

One thing worth investigating are tools that can handle this for you.  Solving this was a key goal for our ElectricCommander product. Commander lets you describe your job as a series of command line steps, and each step can be specified to run on a resource. A resource is simply a system that we’ll remotely execute commands on, and it comes with a sack full of infrastructure goodies you’d expect like pooling, exclusive reservation, broadcast, security, access control, load balancing, and fault tolerance. As a user of the system, you specify what you want to run, and where you want to run it, press the ‘Go’ button and Commander does the rest, queuing steps when resources are oversubscribed and efficiently scheduling around your other constraints. Nice!

Then one day a customer asked us how they could automatically control access to a piece of hardware that simulated network traffic critical to the product’s system test. This wasn’t a gadget we could install software on; indeed, we couldn’t directly connect to it at all, so Commander can’t treat it as a resource. But it soon became evident that we could solve this just as elegantly with a simple tweak to the approach. Fundamentally, we needed the ability to specify that a step 1) needed access, 2) must block when it wasn’t available, and 3) once acquired, hang on to it until it was done. If something could just take care of this synchronization and queuing, the test could connect to the traffic simulator directly and simply execute as if invoked manually.

In other words: the problem called for a *subset* of Commander resources;  ignore half the stuff in the goodie sack (remote login, execution, fault tolerance, etc.) and you’re left with a general purpose resource access and acquisition facility. We set up dummy resources (good old 127.0.0.1, always up and ready for this sort of game!), injected them into the workflow and configured the job to hang on to them as long as it was talking to the traffic simulator. It worked beautifully: each test run was guaranteed to get just the access it needed, and for the first time, the customer had safe, parallel end-to-end automation for the full test cycle.

More importantly, this design pattern, since dubbed Virtual Resources, opened a whole new realm of possibilities. Once you start looking for them, there are *lots* of shared things in a software system that aren’t compute hosts, and they’re all threatening or overcomplicating automation in some way or another.  We’ve used Virtual Resources to manage access database tables, SCM labels, virtual machines, filesystem repositories, flaky external systems that don’t like more than one client talking to them, and our customers keep showing us new ways. It’s a great example of how the core of a clean design — a resource is something a job can request and relinquish — was readily adapted to a wider set of problems around Software Production Automation.

How to quickly navigate an unfamiliar makefile

The other day, I was working with an unfamiliar build and I needed to get familiar with it in a hurry. In this case, I was dealing with a makefile generated by the Perl utility h2xs, but the trick I’ll show you here works any time you need to find your way around a new build system, whether it’s something you just downloaded or an internal project you just transferred to.

What I wanted to do was add a few object files to the link command. Here’s the build log, with the link command highlighted:

gcc -c  -I. -D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -O2 -g   -DVERSION=\"0.01\" -DXS_VERSION=\"0.01\" -fPIC "-I/usr/lib/perl/5.10/CORE"   mylib.c
rm -f blib/arch/auto/mylib/mylib.so
gcc  -shared -O2 -g -L/usr/local/lib mylib.o   -o blib/arch/auto/mylib/mylib.so   \
                \

chmod 755 blib/arch/auto/mylib/mylib.so

Should be easy, right? I just needed to find that command in the makefile and make my changes. Wrong. Read on to see how annotation helped solve this problem.

Read the rest of this entry »

The Cloud Two-Step: How do you know what Dev/Test processes to run in the Cloud?

I just came across a piece that Bernard Golden wrote in his CIO blog entitled Dev/Test in the Cloud: Rules for Getting it Right.   He makes a lot of good points including what we see the most successful enterprise development shops doing with the cloud; “Treat the cloud as an extension, not a separation.” 

Unfortunately, he does not point out what dev/test tools should be put up in the cloud but simply states “dev/test tasks,” as if it is obvious which ones to migrate.  Let’s see if we can leverage his work to figure which dev/test tasks are cloud-ready in two steps:
Read the rest of this entry »

Seven lessons from seven years at Electric Cloud

We wrapped up the 2009 Electric Cloud Customer Summit a couple weeks ago. Like last year, I left refreshed and reinvigorated after hearing so many customers’ stories. Comments like, “Developer builds are now measured in seconds [with Accelerator]. Nobody does local builds anymore,” and, “ElectricAccelerator will give you better performance than you deserve,” really make me feel like all the hard work is worthwhile. But one of my favorite takeaways from the summit was this picture:

From left: Eric Melski, Usman Muzaffar, Sven Delmas, Scott Stanton
team_new

That’s the original engineering team at Electric Cloud, still together after more than seven years. It’s unusual to catch us all together in one spot like this though, since we have much different roles at the company now: I’m an Architect; Usman is VP of Product Management; Sven is Director of Engineering; and Scott is the Chief Architect.

When I saw this picture I wondered if I could find a similar shot from sometime in Electric Cloud’s history. Lucky for me we’ve always had a few shutter bugs at the company, so I had a few to choose from.
Read the rest of this entry »

What is SparkBuild?

At the 2009 Electric Cloud Customer Summit we introduced SparkBuild, a free gmake- and NMAKE-compatible build tool. SparkBuild is now in public beta, and several people have asked us for some more explanation: what is SparkBuild and why should I care? I thought I’d take a crack at answering those questions, hopefully without sounding too “marketingy”. Here goes.

SparkBuild is actually a package containing two components: SparkBuild emake and SparkBuild Insight. As the names imply, these components are derived from the corresponding pieces of ElectricAccelerator and ElectricInsight. That means that they offer some of the same benefits that our commercial product does, but for free, of course. Why did we do this? I’ll be completely honest with you: we’re hoping that people will use SparkBuild, share SparkBuild and talk about SparkBuild, ultimately raising awareness of Electric Cloud and our commerical products. Beyond that though, I’m personally excited about SparkBuild because I want to see these technologies that I’ve worked on for so long get used by as many people as possible.

I think you’ll be interested in SparkBuild because it offers some of the same benefits that our commercial tools do (annotation and build analysis), and even some that aren’t yet part of Accelerator (subbuilds). Read on to learn more about what these features provide and how you can use them.
Read the rest of this entry »

Subbuilds: build avoidance done right

I’ve heard it said that the best programmer is a lazy programmer. I’ve always taken that to mean that the best programmers avoid unnecessary work, by working smarter and not harder; and that they focus on building only those features that are really required now, not allowing speculative work to distract them.

I wouldn’t presume to call myself a great programmer, but I definitely hate doing unnecessary work. That’s why the concept of build avoidance is so intriguing. If you’ve spent any time on the build speed problem, you’ve probably come across this term. Unfortunately it’s been conflated with the single technique implemented by tools like ccache and ClearCase winkins. I say “unfortunate” for two reasons: first, those tools don’t really work all that well, at least not for individual developers; and second, the technique they employ is not really build avoidance at all, but rather object reuse. But by co-opting the term build avoidance and associating it with such lackluster results, many people have become dismissive of build avoidance.

Subbuilds are a more literal, and more effective, approach to build avoidance: reduce build time by building only the stuff required for your active component. Don’t waste time building the stuff that’s not related to what you’re working on now. It seems so obvious I’m almost embarrassed to be explaining it. But the payoff is anything but embarrassing. On my project, after making changes to one of the prerequisites libraries for the application I’m working on, a regular incremental takes 10 minutes; a subbuild incremental takes just 77 seconds:

Standard incremental:
609s
Subbuild incremental:
77s

Not bad! Read on for more about how subbuilds work and how you can get SparkBuild, a free gmake- and NMAKE-compatible build tool, so you can try subbuilds yourself.
Read the rest of this entry »

Private clouds: more than just buzzword bingo

A friend pointed me to a blog in which Ronald Schmelzer, an analyst at ZapThink, asserts that the term “private cloud” is nothing more than empty marketing hype. Ironically, he proposes that we instead use the term “service-oriented cloud computing.” Maybe I’m being obtuse, but “service-oriented” anything is about the most buzzladen term I’ve heard in the last five years. Seriously, have you read the SOA article on Wikipedia? It’s over 5,000 words long, chock-a-block full of the “principles of service-orientation” like “autonomy” and “composability”. What a joke!

Let me see how many words I need to define private clouds. It’s a centralized infrastructure supplied by a single organization’s IT department that provides virtualized compute resources on demand to users within that organization. Let’s see, that’s… 21 words. Not bad, but I bet if you’re like me, you’re probably looking at that and thinking that it still doesn’t make much sense, so let me give you a concrete example.
Read the rest of this entry »

Using Markov Chains to Generate Test Input

One challenge that we’ve faced at Electric Cloud is how to verify that our makefile parser correctly emulates GNU Make. We started by generating test cases based on a close reading of the gmake manual. Then we turned to real-world examples: makefiles from dozens of open source projects and from our customers. After several years of this we’ve accumulated nearly two thousand individual tests of our gmake emulation, and yet we still sometimes find incompatibilities. We’re always looking for new ways to test our parser.

One idea is to generate random text and use that as a “makefile”. Unfortunately, truly random text is almost useless in this regard, because it doesn’t look anything like a real makefile. Instead, we can use Markov chains to generate random text that is very much like a real makefile. When we first introduced this technique, we uncovered 13 previously unknown incompatibilities — at the time that represented 10% of the total defects reported against the parser! Read on to learn more about Markov chains and how we applied them in practice.
Read the rest of this entry »

Getting data to the cloud

One of the problems facing cloud computing is the difficulty in getting data from your local servers to the cloud. My home Internet connection offers me maybe 768 Kbps upstream, on a good day, if I’m standing in the right place and nobody else in my neighborhood is home. Even at the office, we have a fractional T1 connection, so we get something like 1.5 Mbps upstream. One (just one!) of the VM images I use for testing is 3.3 GB. Pushing that up to the cloud would take about five hours under ideal conditions!

I don’t know what the solution to this problem is, yet, but it’s definitely something a lot of people are working on. I thought I’d point out a couple of interesting ideas in this area. First is the Fast and Secure Protocol, a TCP replacement developed by Aspera and now integrated with Amazon Web Services. The basic idea is to improve transmission rates by eliminating some of the inefficiencies in TCP. In theory this will allow you to more reliably achieve those “ideal condition” transfer rates, and if their benchmarks are to be believed, they’ve done just that. However, all this does is help me ensure that transferring my VM image really does take “only” 5 hours — so I guess that’s good, but this doesn’t seem like a revolution.

From my perspective, a more interesting idea is LBFS, the low-bandwidth filesystem. This is a network filesystem, like NFS, but expressly designed for use over “skinny” network connections. It was developed several years ago at MIT, but I hadn’t heard of it until today, so I imagine many of you probably haven’t either. The most interesting idea in LBFS is that you can reduce the amount of data you transfer by exploiting commonalities between different files or different versions of the same file. Basically, you compute a hash for every block of every file that is transferred, and then you only send blocks that haven’t already been sent. On the client side, it takes the list of hashes and uses them to reassemble the file. This can give you a dramatic reduction in bandwidth requirements. For example, consider PDB files, the debugging information generated by the Visual C++ compiler: every time you compile another object referencing the same PDB, new symbols are added to it and some indexes are updated, but most of the data remains unchanged.

Like I said, I don’t know what the solution to this problem is, but there are already some exciting ideas out there, and I’m sure we’ll see even more as cloud computing continues to evolve.