A few years ago I wrote an utility to do resampling analysis on results from codec listening tests. The main reason for writing it was a lack of easily usable (or affordable) tools that do significance analysis on data that doesn't come from a normal distribution, and/or correct the results when many comparions are being performed simultaneously. The need for the latter is easily forgotten, and naive methods of doing it tend to lose a lot of power.
Other people improved the utility to include more powerful statistical methods, but somewhere along the way, the assumption of normality was re-introduced, making it somewhat useless for its original purpose. In addition to that, it still had some hard coded limitations from the original version.
I ended up rewriting the tool entirely in Python, removing all hardcoded limitations, making it more configurable and replacing some of the tests by more robust ones with fewer assumptions.
More documentation is in the comments of the script. Licensed under the Affero GPL version 3 or later.