the latest release of http://www.tv4play.se/ added not only membership but we also upgraded the version of rails for tv4play and our api from 2.3.x to rails 3.0.x. most of the application is hosted on heroku but the api we use to access our content is hosted in our current java infrastructure. this set up has worked really we as we can with the help of jruby and warbler we have been able to package our api as a ruby on rails application and deploy in our java infrastructure. since the start we have not needed to modify any of the default warbler configuration for our api.
that all changed though when after we upgraded the api to rails 3.0 and put it in production on january 12, 2011. we started seeing that the servlet container that hosted our api was crashing getting OutOfMemory errors. we where running out of permgen os we increased the heap and permgen (-Xmx2048m -Xms2048m -XX:MaxPermSize=512m). that relieved our problems iwth running our of permgen but the api was still shaky as the performance was not near what was saw before the upgrade to rails 3.0. since we when running on heroku we could hide this problem pretty well by increasing the number of dynos tv4play was using.
the day after we launched(january 13) we bit the bullet and tried to see if we could replicate the performance problems in our staging environment. we began by parsing the rails log files to get a sample of the urls that were being called so we could make a test file for pylot which we have used in the past to generate load in our staging environment. to start off we did a fresh deploy of api that was in production and warmed it up with a simple test run of 10 agents (-a 10) for 60 seconds (-d 60).
python run.py -a 10 -d 60 -x api4-testcases.xml
this worked but the throughput was not that great. so we tried running it a with a longer duration (-d 300) but the throughput really did not increase much. we were getting around 10-15 request per second.ok longer duration was not the problem so we tried with 20 agents for 300 seconds. that’s when we saw throughput plummet and the error rate to go up. once we had some data to discuss we came up with some possible areas that need to be looked into and or fixed. one of the was to go over the configuration rails and warbler. another was to start using memcached which we had installed in both our staging and production infrastructure. we started with memcached as we had experience with it from deploying on heroku and it is pretty straight forward. so we deployed the api configured to use memcached and ran the same set of tests again(10 agents for 60 seconds, 10 agents for 300 seconds, and 20 agents for 300 seconds). we got a little better performance but nothing like what we expected. memcached did give us consistency we did not have before as we were experiencing some corrupted replies from the api from time to time. so the next area to look into was rails configuration. the readme.txt for warbler said that the it would detect if threadsafe was enabled and disable runtime pooling. so in config/environments/stage.rb we enabled config.threadsafe!
# Enable threaded mode config.threadsafe!
deploying and running the same set of tests again we saw the same results around 10-15 requests per second. ok next up was to look at our config/warbler.rb and see if there is anything in there that might help. most of the configuration of warble is getting rack/rails set up but there we two parameters that had always been commented out. it was friday (the 14th) shortly after 15:00 and most of the team was worn out after a couple of heroic nights of trying to fix this and was just about to give up on jruby and this entire application over to heroku. so one last hail mary and try just uncommenting those lines and see what happens.
# Control the pool of Rails runtimes. Leaving unspecified means # the pool will grow as needed to service requests. It is recommended # that you fix these values when running a production server! config.webxml.jruby.min.runtimes = 2 config.webxml.jruby.max.runtimes = 4
one deploy later we ran the first batch of tests to warm it up before the real test. that’s when i needed a second opinion. the throughput just increased by 10.
python run.py -a 10 -d 60 -x api4-testcases.xml ------------------------------------------------- Test parameters: number of agents: 10 test duration in seconds: 60 rampup in seconds: 0 interval in milliseconds: 0 test case xml: api4-testcases.xml log messages: False Started agent 10 All agents running... [################100%##################] 60s/60s Requests: 5460 Errors: 0 Avg Response Time: 0.098 Avg Throughput: 90.49 Current Throughput: 118 Bytes Received: 42546300
but how did it look after 300 seconds? before this change when running 10 agents for 300 seconds *worked* but at the end of the run it was time to restart the servlet container. so we fired up pylot with 50 agents for 300 seconds. thats right skipped right over the max we tried earlier i wanted to see if it could take it.
python run.py -a 50 -d 300 -x api4-testcases.xml ------------------------------------------------- Test parameters: number of agents: 50 test duration in seconds: 300 rampup in seconds: 0 interval in milliseconds: 0 test case xml: api4-testcases.xml log messages: False Started agent 50 All agents running... [################100%##################] 299s/300s Requests: 29379 Errors: 0 Avg Response Time: 0.4471 Avg Throughput: 97.93 Current Throughput: 180 Bytes Received: 226881811
at this point the entire development teams was watching a terminal and before the test was finished saying that the ban on friday deploys be damned this *had* to go out to production.
up until that point i don’t think everybody trusted jruby as a viable deployment environment. the performance we where getting now as well as the ease of deployment in our java infrastructure pretty much shelved that conversation. over the weekend we were able to see a dramatic increase in the stablity of tv4play as a result of enabling those 2 parameters. we still had one issue that was affecting the performance of tv4play but that has now been resolved and play is more performant now than it has been.
i spent a good deal monday with different values for config.webxml.jruby.min.runtimes and config.webxml.jruby.max.runtimes but 2 and 4 seem to work quite well and i recommend anybody deploying with jruby and rails/sinatra/rack to just uncomment those lines and not worry about it.