crash and burn 2012

March 2 2012 peter svensson hosted the crash and burn at KTH Forum in Kista Stockholm.  The theme of the conference was integration, testing, deployment and virtualization.  It was a great conference even and hope it happens again next year as it added quite a few software projects to look at until then. Links to the speakers their presentations follows:

Sam Newman   Designing for rapid release

Can’t/don’t design huge monolithic systems especially if you want fast feed back and deployments.

Yan Pujante glu: open source deployment automation platform

You don’t have to build your own deployment system especially if you are deploying to java. Glu project provides tons of features to deploy any if not all types of web based systems (currently used by linkedin.com)

Mårten Gustavsson Ops side of Dev

Developers and operations have to work together if you are going to have any chnace of a sane production evironment. There are a lot of small things like logging that benefit from both dev and ops agreeing on what to log. Metrics are another key component to good cooperation (check  out http://metrics.codahale.com/ heck anything on https://github.com/codahale/)

John Stäck DNS in the Spotify Infrastructure (pdf 2.7 mb)

Lots of good information on how spotifiy uses  dns as a distrubted data store.

Carl Byström Load testing with locust

Load testing tools should be programmable(ie not xml an python fits well here) and they should reflect what the end user is going to do.

Leonard Axelsson & Ville Svärd Graphite – the village pump of your team

Metrics on a live system and seeing what your application and it’s users are doing is an invaulable for finding performance issues

Brian Riddle Continuous Integration the good, bad and ugly

Need to talk a little slower and maybe a demo. In preparation for this talk i gave a lunch seminar at valtech’s headquarters more info och video on their blog. That presentation is here.

Zach Holman Scaling Github

Every time someone from github gives talk you find interesting tidbits and the one that struck me the most? github has a employee retention of 100% and they are *still* growing. imagine working for a company like that.

Frozen Rails: Geoffrey Grosenbach (Peepcode screencasts)

Notes

The Frozen Rails conference anno 2011 began with a keynote speech by Geoffrey Grosenbach of Ruby on Rails podcast fame. Nowadays he works on educational screencasts on Peepcode.com.

He starts by referring to Frederick P. Brooks’ now classic book on software development, The Mythical Man-Month, which describes the art of creating software as “Creating something out of nothing”.

As software developers Grosenbach thinks that we should learn from artists. One thing to learn is to delete the first attempt instead of trying to refactor it (quite the opposite from what Joel Spolsky preach in Things You Should Never Do, Part I. Geoffrey himself makes a new blog layout from scratch for every blog entry and thereby learns new techniques every time. But before even writing the code you should spend some time away from the computer and work on the problem, like sketching a database schema, sketching on the UI or sketching on the program flow. On paper.

Another important things to apart from writing code are to take naps (get enough sleep), laugh and get physical exercise. The time you take to reflect and think pays off in software quality. One example Grosenbach mentions is the engineer that finds the practical solution to Hubble space telescope’s blurry images when standing in the shower.

A practice in art that unfortunately hasn’t proved successful in software development yet is critique. By giving or receiving critique you are likely to become a better programmer, but critique of the code is often confused with that you think less of the person writing the code. It is especially hard to give critique of code when it isn’t done in person. One example of thorough critique given of code come from the Melbourne Railscamp in November 2009 where Github commits where used to comment the code and give suggestions of improvements.

Septimana horribilis

Back in 1992 Queen Elizabeth II talked about annus horribilis. Now, since the romans didn’t had the notion of seven-day weeks it is perhaps a bit far-fetched to talk about a septimana horribilis but this week has been a week of surprises, none of which very pleasant.

Last Sunday a new season of the reality show Big Brother premiered on TV11, which spawned a large interest for the web cast of the events unfolding in a house in the outskirts of Stockholm. The ratings for the TV broadcast has subsided somewhat since but the interest on the web has been and is astonishing. Since we modularized the code we wrote for our main video service, TV4 Play, it was possible to release a video site on a tight deadline. We had however almost no time for load testing so the actual premiere was the first real load testing. On this first day we saw that the controller fetching updates from Twitter (via Apigee) was in the millions and there also was other parts of the code that could have been optimized. Luckily one of the benefits of running a service in the cloud is that we can compensate by adding more processing capacity with the flick of a button, by just increasing the number of Heroku dynos we could buy time. Despite this we decided on releasing a patch a few minutes after midnight that avoids unnecessary checks on the session and increases the interval at which Twitter updates is fetched.

At Tuesday night we are pretty sure that we have found the most critical parts that can be optimized. There are of course things that you cannot control, and on Tuesday night on of the large backbones in Europe run into difficulties. And for a couple of hours a few of our sites are so slow that they take forever to load.

On Wednesday afternoon our content management system, built upon Polopoly, fills up its logs with stack overflow errors. We are in luck and can find out the operation that got us there in an hour or so but it still takes four hours before we have fixed the corrupt database again. Our public service competitor SVT also have difficulties with their Polopoly installation this afternoon.

On Thursday the network communication consultants Cygate set up firewall rules that isolate our video management system from the rest of the world. Despite Cygate’s payoff “Alltid där” (Always there) the servers aren’t there and none of the videos on our sites can be watched.

On Friday a test on the platform used for video subscriptions goes wrong and for a while no-one can log in or see videos.

Somewhere between Friday and Saturday Vizrt, the company that encodes videos for mobile devices have a hardware failure and the problem isn’t fixed until Saturday night which means that no new content comes ut to our iPhone apps.

Since then we have experienced another errors on this Sunday and Monday but we certainly hope we get more time to actually do something productive this week.

optimzing jruby rails 3.0 performance

the latest release of  http://www.tv4play.se/ added not only membership but we also upgraded the version of rails for tv4play and our api from 2.3.x to rails 3.0.x.  most of the application is hosted on heroku but the api we use to access our content is hosted in our current java infrastructure.  this set up has worked really we as we can with the help of jruby and warbler we have been able to package our api as a ruby on rails application and deploy in our java infrastructure.  since the start we have not needed to modify any of the default warbler configuration for our api.

that all changed though when after we upgraded the api to rails 3.0 and put it in production on january 12, 2011.  we started seeing that the servlet container that hosted our api was crashing getting OutOfMemory errors.  we where running out of permgen os we increased the heap and permgen (-Xmx2048m -Xms2048m -XX:MaxPermSize=512m).  that relieved our problems iwth running our of permgen but the api was still shaky as the performance was not near what was saw before the upgrade to rails 3.0.  since we when running on heroku we could hide this problem pretty well by increasing the number of dynos tv4play was using.

the day after we launched(january 13)  we bit the bullet and tried to see if we could replicate the performance problems in our staging environment.  we began by parsing the rails log files to get a sample of the urls that were being called so we could make a test file for pylot which we have used in the past to generate load in our staging environment. to start off we did a fresh deploy of api that was in production and warmed it up with a simple test run of 10 agents (-a 10) for 60 seconds (-d 60).

python run.py -a 10 -d 60 -x api4-testcases.xml

this worked but the throughput was not that great. so we tried running it a with a longer duration (-d 300) but the throughput really did not increase much. we were getting around 10-15 request per second.ok longer duration was not the problem so we tried with 20 agents for 300 seconds. that’s when we saw throughput plummet and the error rate to go up. once we had some data to discuss we came up with some possible areas that need to be looked into and or fixed. one of the was to go over the configuration rails and warbler. another was to start using memcached which we had installed in both our staging and production infrastructure.  we started with memcached as we had experience with it from deploying on heroku and it is pretty straight forward. so we deployed the api configured to use memcached and ran the same set of tests again(10 agents for 60 seconds, 10 agents for 300 seconds, and 20 agents for 300 seconds). we got a little better performance but nothing like what we expected. memcached did give us consistency we did not have before as we were experiencing some corrupted replies from the api from time to time. so the next area to look into was rails configuration. the readme.txt for warbler said that the it would detect if threadsafe was enabled and disable runtime pooling.  so in config/environments/stage.rb we enabled config.threadsafe!

 # Enable threaded mode
  config.threadsafe!

deploying and running the same set of tests again we saw the same results around 10-15 requests per second. ok next up was to look at our config/warbler.rb and see if there is anything in there that might help.  most of the configuration of warble is getting rack/rails set up but there we two parameters that had always been commented out.  it was friday (the 14th)  shortly after 15:00 and most of the team was worn out after a couple of heroic nights of trying to fix this and was just about to give up on jruby and this entire application over to heroku.  so one last hail mary and try just uncommenting those lines and see what happens.

  # Control the pool of Rails runtimes. Leaving unspecified means
  # the pool will grow as needed to service requests. It is recommended
  # that you fix these values when running a production server!
  config.webxml.jruby.min.runtimes = 2
  config.webxml.jruby.max.runtimes = 4

one deploy later we ran the first batch of tests to warm it up before the real test.  that’s when i needed a second opinion. the throughput just increased by 10.

python run.py -a 10 -d 60 -x api4-testcases.xml

-------------------------------------------------
Test parameters:
  number of agents: 10
  test duration in seconds: 60
  rampup in seconds: 0
  interval in milliseconds: 0
  test case xml: api4-testcases.xml
  log messages: False

Started agent 10

All agents running...


[################100%##################] 60s/60s

Requests: 5460
Errors: 0
Avg Response Time: 0.098
Avg Throughput: 90.49
Current Throughput: 118
Bytes Received: 42546300

but how did it look after 300 seconds? before this change when running 10 agents for 300 seconds *worked* but at the end of the run it was time to restart the servlet container. so we fired up pylot with 50 agents for 300 seconds. thats right skipped right over the max we tried earlier i wanted to see if it could take it.

python run.py -a 50 -d 300 -x api4-testcases.xml

-------------------------------------------------
Test parameters:
  number of agents: 50
  test duration in seconds: 300
  rampup in seconds: 0
  interval in milliseconds: 0
  test case xml: api4-testcases.xml
  log messages: False

Started agent 50

All agents running...


[################100%##################] 299s/300s

Requests: 29379
Errors: 0
Avg Response Time: 0.4471
Avg Throughput: 97.93
Current Throughput: 180
Bytes Received: 226881811

at this point the entire development teams was watching a terminal and before the test was finished saying that the ban on friday deploys be damned this *had* to go out to production.

up until that point i don’t think everybody trusted jruby as a viable deployment environment. the performance we where getting now as well as the ease of deployment in our java infrastructure pretty much shelved that conversation. over the weekend we were able to see a dramatic increase in the stablity of tv4play as a result of enabling those 2 parameters. we still had one issue that was affecting the performance of tv4play but that has now been resolved and play is more performant now than it has been.

i spent a good deal monday with different values for config.webxml.jruby.min.runtimes and config.webxml.jruby.max.runtimes but 2 and 4 seem to work quite well and i recommend anybody deploying with jruby and rails/sinatra/rack to just uncomment those lines and not worry about it.

Visualizing your code!

Source code visualization tools can provide amazing, shiny and pretty useless videos. CodeSwarm is a well known tool in many OSS projects, another one is Gource.  Gource is a tool for visualizing commit history in your version control system. Gource produce an easy to understand repository tree showing active areas and users.  To produce a video, simply run this command.

gource -s 0.03 --auto-skip-seconds 0.1 --file-idle-time 500 --max-files 500 --multi-sampling -1280x720 --stop-at-end   --output-ppm-stream - | ffmpeg -y -b 3000K -r 24 -f image2pipe -vcodec ppm -i - -vcodec mpeg4 gource.mp4

Here is the result for the first 6 month with tv4play.

Ok, that was nice!

But how could we use visualization tools to something more meaningful?

  • It is possible to identify new development, large refactoring, collaboration and developmentspeed.
  • Use visualiztion tools to evaluate OSS project, e.g. is the community solid?
  • Explaining how software collaboration work and how developers come and goes.

Watch this Michael Ogawa video to learn more about Sofware Visualization.
http://vimeo.com/3914346