Friday, June 6, 2014

Comments on “My Thoughts on Unit Testing”

I wrote some stuff about unit testing for my team but I also put it here. Another engineer wrote a very well thought out response, and I felt like it makes sense to put that response up. Duplicating his response with his verbal permission.

Comments on “My Thoughts on Unit Testing”
By Dave Hinman

I agree with most of Will’s document. There were just a few comments I wanted to make, and a few additional points I want to raise.
In the “Thesis” section, Will commented:
… 3) There is a relationship between how difficult the code is to test and the quality of the unit test you generate for this code. When the code is harder to test the resulting unit tests are of lower quality.
While it can be true that such tests would be of lower quality, we should strive to make them as high quality as possible. And I think it’s important to note that test quality can be a complicated notion. One aspect of it is about how completely it characterizes the code under test; I think that is what Will is talking about here. But another aspect of the quality is the value it brings. An incomplete test of complicated code can be of greater value that a complete test of simpler code, if it helps prevent an erroneous change.
… Even if you do not believe the third idea you still should accept that to maximize the effectiveness of your efforts you should focus on high quality tests and/or easily constructed tests.
I strongly agree about the importance of high quality tests. But the question of maximizing effectiveness is an interesting one. One perfectly valid way to look at it is that we have a certain amount of effort to expend in the testing, and we should get the most from that effort; I think that aligns with Will’s position here. But suppose you were talking about coding instead of testing. While it’s true that some part of the code will be harder to write than other parts, we don’t think the coding is done until we’ve done it all.
Now consider testing again. True, if we define what we want to do with testing as hit some particular coverage number, it might make sense to get that number the easiest way possible. But if we think there are certain parts of the code we want to test, we wouldn’t consider the testing done until we had tested those parts of the code. The question of maximizing effectiveness doesn’t really come into it.

Safe Refactoring

Will argues that unit tests don’t help with refactorings big enough to involve moving responsibilities between methods. I think this is where we have the biggest disagreement. As he points out, refactoring is both very important and potentially risky, and we want to do what we can to reduce the risk to encourage the refactoring. As he also points out, refactoring can make tests break. He views this as an impediment to the refactoring, but to me, it can help to make sure nothing is overlooked while refactoring.
It seems to me that the ideal is that when we refactor the code, either the refactoring is valid or some unit tests break. The unit tests are characterizing the behavior and responsibilities of the units of code. When we move those responsibilities around, tests will break. Then we go through the failed tests to understand why they failed. If we’ve moved a responsibility R from method A to method B, we would expect the test(s) for R in A to fail; we will be reminded to enhance the unit tests for B to include testing for R as we modify the A tests to make them pass. But suppose in our refactoring we were moving so many responsibilities that we overlooked R and didn’t put it anywhere. Looking at the test failures would hopefully reminds us of that fact. If we didn’t have unit tests checking for responsibilities, it would be (in my view) much easier to forget about them in our refactoring. So while Will sees the failing tests as a cost to the refactoring, I see a corresponding benefit that to me outweighs the cost.

What makes a good test?

Will identifies the concept of testing for identicalness, and points out correctly that tests that are really just for identicalness are not helpful. But there is a bit of a thin line involved, particularly in “glue” code as he defines it.
Consider a method loadAndCleanData(), that loads up some data, sorts it and removes duplicates. Say for the sake of argument that it does it’s work by calling loadData(), then sortData(), then removeDuplicates(). I would say that those 3 things are the responsibilities of this method, and a proper unit test of the method would establish that it fulfills these three responsibilities. One reasonable way to do that testing is inductively: assert that it calls these 3 methods and rely on the unit testing of each of those methods to establish that they load, sort and de-dup respectively. You could view that as a test of identicalness, because if you remove one of those calls from loadAndCleanData(), it will fail. But I would say that if you remove one of those calls, the method no longer fulfills its responsibilities, so the unit test ought to fail. I’ll have more to say about this example later on.
So I agree that a test that is just for identicalness is not useful, but that a test for responsibilities might look like an identicalness test and still be valid. It’s important that the test for responsibilities not over-characterize the behavior to provide flexibility. For example, if it the order of sorting and de-duping doesn’t matter, the test should not insist on a particular order.

When does a test fail, part 2

Will writes:
 For major refactoring, almost all unit tests will fail and need to be updated. This is because the unit of test will change since methods will change and responsibility will be redistributed between the methods.
We agree that some of the tests will fail (though my experience is that it’s not as many as Will is suggesting), but in my view the effort of analyzing the test failures can make sure that we don’t overlook anything when we do the refactoring. So the failing tests will slow the refactoring down, but as I see it they will add corresponding value, helping to identify responsibilities that we may have overlooked.

Mirrors

Will is concerned about inappropriate use of mocking:
…Another similar pattern I have seen is the mocking of all calls. Testing frameworks have been built to allow you to intercept and mock external calls. This is useful because you want to isolate code, but this tool can be used to essentially copy all the calls and make a test saying, this method works if it makes all the calls that it happens to make. This tends to be a test of identicalness. Say for example, you discover a new method that works better for your purposes. The test then fails because you didn’t make the exact same calls. This kind of test is like a mirror held up to all the calls of the method.

I agree that testing just identicalness is not helpful and is restrictive, but I think testing the responsibilities of a method can look like an identicalness test and still be valid. As I said earlier, tests characterizing responsibilities should try not to over-characterize the code.

What kind of code is good to test?

I think Will has identified a valid distinction between logic and glue code, and I agree strongly with his ideas about separating logic and glue code, and keeping glue code as simple as possible. But I disagree that glue code is hard to unit test, and I disagree that it has little value. In general, that glue code has associated responsibilities, and we want to protect against innocently removing responsibilities without warning.
Consider loadAndCleanData() again. Suppose removeDuplicates() only works on sorted data. Ideally, its name would be removeDuplicatesFromSortedData(), but suppose that is not the case. If we find a performance problem in loadAndCleanData(), and a new person comes to the code, one could easily imagine them thinking “We do client-side sorting anyway; let me just remove the call on sortData() and the whole thing will run much faster”. This kind of error seems very possible; how do we prevent it?
One way would be for removeDuplicates to check its inputs to confirm they are sorted. That would impose an unpleasant burden at runtime. Also, we wouldn’t find out about the problem until later. The easiest way to prevent this incorrect refactoring would be if the unit testing for loadAndCleanData() had a test ensuring that the data was sorted before removeDuplicates() was called. It would be particularly helpful if the test message explicitly stated that removing duplicates requires sorted data. When the refactorer ran the unit tests, they would see that removing that sorting is not okay. If a subsequent refactoring modified removeDuplicates() so that it doesn’t require sorted data, so the sorting was removed, the unit test would fail, but the developer doing the refactoring would recognize that the sorting was no longer needed, so they would quickly modify the test to remove that responsibility.
While I think this logic/glue distinction is a real one, one concern I have is about the ability of tooling to recognize the difference.

Code Metrics

Will correctly points out that high coverage doesn’t mean either that the code is well tested or that it is of good quality. It’s always possible to write bad tests. Nor does low coverage mean that the code is of bad quality. But it does mean that the code has not be well unit tested. I believe if code has not been tested, you don’t know if it works. Integration testing and end-to-end testing are an important part of the testing picture, but it’s hard to ensure that such testing has gotten to all the parts of the code. The easiest way to ensure that part of the code is tested is to unit test it. That way we find errors earlier.

A few other points

These are things that weren’t in Will’s document but that I wanted to mention.

Unit tests as documentation

I believe it’s very useful both for somebody new to the code or if you come back to it after a log while to have some sort of characterization of what the code is supposed to do, including how it behaves in corner cases. I have seen a number of bugs that were essentially disagreements about where a responsibility lay: in the caller or the callee. If the callee’s behavior is well documented, it’s easy to resolve these disagreements: anything the callee says it does, it’s supposed to do; the rest is the caller’s problem.
One common way to do this documentation is JavaDoc (or its equivalent). But a common complaint about this is that the documentation can become out of date. I think the best way to do that documentation is in the form of unit tests. That way, they are guaranteed not to get out of date.

The power of 100%

I think the most important thing is that we all agree on what code we want to test. I think using a coverage metric is very important, to help encourage us to do the testing we know we need to do. I think we should agree on some level that we can instrument, be it 2% or 100%.
One benefit, though, of 100%, is that it can eliminate a temptation that I myself have fallen prey to. Suppose we think 42% is the test coverage we need. I make some changes, and the code coverage drops to 40%. The best thing for me to do is to better test the changes I’ve just made. But if I need to get home to watch “Fargo”, I might be tempted to go test shopping: looking for some other part of the code to test that would be easier. As Will has pointed out, it’s best if I test my own code as I’m developing it: I’m in the best position to refactor it, and I will likely learn something that will make my subsequent coding and testing better. Going for 100% prevents me from doing this shopping.

Conclusion

I actually agree with all of Will’s conclusions except: Don’t Test Everything – Isolate Glue . I think that testing the glue code as well (to whatever level we all agree) will help drive us to keep the glue code simple and provide refactoring protection, and I don’t think it’s hard to test (especially if we keep it simple).
I would add one more conclusion:

Choose a coverage level and enforce it – builds fail if the coverage level is not met

We want to make sure that the testing happens as we are developing, because that is when the code is freshest in the developer’s mind, and they can apply lessons learned in subsequent coding and testing. When I set up the build infrastructure, I deliberately prevented lint errors in the source code from failing the build, because we are a democracy and to do otherwise seemed heavy-handed. But I think that was the wrong decision. It resulted in lint errors getting cleaned up at the end, and kept each of us from learning from our own code patterns. I think we should pick some coverage level, and force the build to fail if the coverage level is not attained.



No comments:

Post a Comment