Wednesday, June 4, 2014

My Thoughts on Unit Testing

I wrote this document for my team and I put a decent amount of thought into it so I am going to paste the document into my blog.

Thesis
If you want have a nail you need to hammer in, you need a hammer. If you have a screw you want to screw in, you need a screwdriver. A hammer does a poor job at screwing in screws. A screwdriver does a poor job at hammering in nails. Unit tests are a tool.  I believe that unit tests, like every single tool in existence, are not useful in all situations.

I want to propose a few ideas. 1) Tests range in usefulness. One unit test may be better than another. A unit test could even have such low quality as to be useless. 2) Certain code lends itself to unit testing more than other code. Some code is easily tested and some code is very difficult to test. 3) There is a relationship between how difficult the code is to test and the quality of the unit test you generate for this code. When the code is harder to test the resulting unit tests are of lower quality.

If you accept these ideas then it follows that to maximize the value of your unit tests you should focus on writing good tests on code that is easily unit tested. Even if you do not believe the third idea you still should accept that to maximize the effectiveness of your efforts you should focus on high quality tests and/or easily constructed tests.

So we need to understand the value of unit testing and how to create effective tests. We should also understand what code lends itself to testing. We want to get the most value out of the tests we write and not waste effort on activities that will not produce value in the long run.

Tiny Bit of History
Okay, first a tiny bit of history. Unit testing started coming to the fore in the late 90's as part of the set of Agile methodologies. Unit testing was a focus of Agile due to the focus on the code being able to tolerate rapid change. Now Agile and unit testing are considered standard parts of engineering now. Related to unit testing in Test Driven Development.  This is the practice of writing unit tests alongside and slightly before you write code.

Benefits
Effects on Coding
One of the key benefits of unit testing is forcing engineers to write code that is unit testable. If you are going to write unit tests you want to make sure the code you are writing won't cause a problem when you write the tests. So let me take that and ask, what makes my code easy or hard to unit test?

Pure Functions
Let's start with the notion of a 'pure function'. What programmers call a 'pure function' is what mathematicians call a 'function'. Programmers took the word function and abused the hell out of it till it meant something different. (We didn't do it on purpose. It just sort of happened.) So a pure function is a method that will always return the same thing for a given input. Pure functions are much easier to reason about because they have consistent behavior. They will also be more robust since they are only affected by incoming variables and are not prey to uncertain external state. Functional programming tends to stress having functions be pure. And it turns out pure functions are very easy to unit test. You have a set of inputs and you know the exact outputs they will generate.

Note that the idea of 'pure functions' is somewhat against the notion of object oriented programming because object oriented programming focusses on passing a hidden this parameter to every function. This does not mean that object-oriented languages and pure functions cannot co-exist peacefully.

We know that pure functions are easy to reason about, more stable, and very easy to test. So if we want to write easily unit testable code we should strive to contain as much of our code as possible in pure functions. Any time you write code you should think about whether you can isolate a piece of this code into a pure function. Because we are using a functional (note that JavaScript is functional, but not 'pure functional') programming language we have a flexibility with functions that will help us do this. And if we are writing code as pure functions we have code that is clear and easy to test. So unit testing helps to create an incentive to write pure functions which makes our code better.

Cyclomatic Complexity / KISS Principle / Single Responsibility Principle
When you have a method that does an enormous amount of stuff it is very hard to test. Imagine a piece of code with a lot of if statements that does very different things depending on various conditions. This code becomes hard to test. The tests have to set up precise conditions to trigger different sections of code. The tests become more and more intricate and harder to understand. The simple step to solve this problem is to break up the code into smaller pieces. This makes it easier to write tests for the code in question

It is hard to deal with complexity. Unit tests make it harder so they provide an incentive to break down complex methods in order to make life easier. This leads to improved code that is simpler.

External Dependencies / Coupling  (Not the British sit-com)
External dependencies make your code harder to test. To unit test a method with external dependencies you need to generate mocks or maybe some interception scheme. These schemes can often be complex and hard to work with making the test more complex than the code it is testing. So to make unit testing easier you try and minimize or isolate these dependencies.

Note that external dependencies are what is sometimes referred to as 'coupling'. Coupling is a measurement of how dependent an object is on other objects. Unit tests give us the incentive to minimize and isolate external dependencies and reduce coupling.

Maximizing this Principle
Unit tests incentivize the writing of pure functions, the reduction of cyclomatic complexity, and the reduction of coupling. This means the unit tests incentivize us to write good code. But this brings up the issue of when we write those tests. If unit tests are to help drive us to create better code we need to change our code while we write tests. If we generate the code and then view it as complete before we write tests we lose most of the benefits of the unit testing making us code better. Test driven development (TDD) says that you should write the tests as you code and that coding and testing should be a tightly interwoven process.

If we do not change code in response to our creating unit tests we will get into situations where our unit tests have to employ special strategies that may be complex or not robust. Using these strategies to get around the way tests force better coding circumvents the positive effect on code quality of unit testing. If is a far better strategy to write the unit tests closer to the code so the code can be adapted to the needs of the tests.

Contracts
Unit tests also provide  a way of defining a contract. You make a statement about what a piece of code does and you have a guarantee that this is true. This is very useful in a few scenarios, but not necessarily all scenarios. Application Programming Interfaces (APIs) are sets of methods to interact with a component. They essentially act as a contract for a piece of code.  You want them to have very specific behavior and you want to be sure that they always keep this behavior. Unit tests are a good way to be certain that an API continues to behave as promised.

APIs are usually thought of as ways to interact with third party components, but it is possible to have a large software products and have boundaries between teams defined by an API. This makes APIs even more useful because they can act as a means of communication. One team can produce tests to highlight they behavior they require while the other team can them be sure to meet those tests. In the past I have crafted failing unit tests for another teams components to show them how their component was flawed. This allowed them to fix the issue and then directly run my test to confirm the problem was fixed. They could then add this test to their suite to make sure this contract was fulfilled going forward.

Safe Refactoring
Code can often be difficult to understand. Reading code to understand it is a different skill than writing code to accomplish a task. Code also is often modified to produce new behavior and this modification can make code harder to understand. Refactoring refers to rewriting code to make it clearer. Refactoring code has a lot of value because easy to understand code is more maintainable. However, changing code introduces risk that the behavior changes. This could introduce a defect. Some organizations take a "don't touch it" approach to code.

Unit tests allow us to refactor code without concern. Note that this really only applies when the full output of the method under test is tested. With unit tests providing a safety net you can refactor easily. One caveat to this is that while these unit tests help with refactoring within a method they do not help when the refactoring goes against methods and changes what methods they are and how they interoperate. So they help with small refactorings, but can actually make big refactorings harder.

Tests to help with refactoring also can be a little different than other tests. Tests for refactoring often want to ensure output is consistent. I will talk a little more about this later on. Unit tests for refactoring can often make sense to write after the code is complete. You can create code to ensure you do not introduce any changes. Again, the challenge becomes making sure your unit tests capture the behavior. If your code is hard to unit test it becomes less likely that your tests will be able to capture behavior and more likely that refactoring will introduce problems.

Test Driven Development
Test Driven Development is the process of writing unit tests slightly before writing production code. In this process no production code is written unless it is to make a test pass. One of the goals is to have extensive unit testing. Many people consider this the gold standard for programming. If you have tried Test Driven Development you have probably shared the experience that the quality is much higher and the TDD helps you find errors from cases that you normally wouldn't have considered too much. Despite the fact that I have such a high opinion of TDD I actually rarely use it, because many problems do not lends themselves to TDD. For example,  I want to write a piece of code that produces a specific bit of markup. My test would be to write the markup first. I am not convinced this is a useful test. Say that I have code that gathers data from two sources. I immediately have to write some sort of mock in my test saying something like 'did I call method X'. Then I would write the call to this method. Again, in the right circumstances TDD seems very useful, but in others circumstances it seems to be a hindrance. When you have a problem you can ask yourself a few questions: Can I write tests before I write the code? What is easier, writing tests or the code? Do the tests require the code to essentially be written first because there are dependencies?

What makes a good test?
If we want to maximize the value of the effort we put in to testing we need to figure out how to write a valuable test.  So we need to understand what a valuable test is. I will try and offer some rough advice on how to identify a useful test and I will introduce a categorization as to whether a test checks correctness, consistency, or identicalness.

When does a test fail?
When creating unit tests I believe a very important question to ask is, "when will this test fail?" There are a couple possible answers to this: 1) a unit test will fail when the code under test no longer behaves correctly (fails correctness), 2) a unit test will fail when it produces a different result than expected (fails consistency), 3) a unit test will fail when the code under test changes (fails identicalness), 4) the unit test will fail based on changes in external dependencies. Another answer, the TDD answer, is "right before the code satisfying the test is written."

Let me clarify the difference between 1 and 2, correctness and consistency. A method like "addition" produces a correct answer. For a method that produces a chunk of HTML it is harder to quantify the output as correct or not. For example, perhaps you want to add a new class to the output HTML. This output is still "correct", but is no longer the chunk that used to be produced. So this output is "correct" while not being "consistent". The line between "correct" and "consistent" can be blurry. The line between "consistency" and "identicalness" can also be fuzzy.

I believe that tests that fail in condition 1, correctness, are very useful. Tests that fail in condition 2, are somewhat useful. They are useful for refactoring, but they can fail on perfectly valid code and will force unit test updates.

Tests that fail in condition 3, identicalness, are very suspect in my mind. I am not sure what those tests are really testing. They detect code change. I think we can largely assume that unchanged code will continue to perform in the same way as it did before. A test that just fails when the code is changed seems to detect something you know is happening so they don't seem very useful.  You can make the argument that making it hard and costly to change code is valuable because it prevents you from introducing potentially breaking changes. I don't think making it hard to change code has any value. The value of unit tests should be that they make it easy to change code. If you want to make it harder to change code place your keyboard on the ground and type with your toes. It will force you to think about every line you type.

Tests that fail in condition 4 are bad and need to be refactored. Tests that fail when nothing in the code under test changes are false positives. False positives create added maintenance for tests and unnecessary overhead. Unit tests need to be extremely isolated. We use dependency injection to allow for unit tests to be completely isolated from other parts of the system.  Unfortunately, we have tests that are sensitive to configuration changes and fail when we change unrelated things. I think that when we have a unit test failure we should change either the code under test by the failing test or we should change the unit test. Changing other parts of the code does not fix the situation and a failing test should prompt some sort of change to address the failure whether that is refactoring the code or refactoring the test.

When does a test fail? Part two!
I talked about what causes tests to break, but when do tests really fail. They fail when you go back and modify the code. The situations where you go back and modify code can be roughly broken down into a few categories. 1) A defect is detected that was not detected by unit tests. 2) New features are being added or feature behavior is being changed. 3) Minor refactoring is being done. By minor I generally mean refactoring within method boundaries. 4) Major refactoring. By major I mean refactoring that structurally changes the code and shifts responsibility between methods or objects, either old ones or newly created ones.

When fixing defects or adding features, tests focusing on correctness will give you good information and detect whether a defect fix introduces a new problem. Tests focused on consistency will probably fail because behavior is changing. This information will probably not be useful since a behavior change is intended. The change in output may be a mistake made while coding, but it is more likely that the test if failing because output intentionally changed. Tests for identicalness will fail because you have altered code. This is not useful feedback from the test.

For minor refactoring both correctness and consistency tests will give good information. Tests for identicalness will give bad information because your intent is for the code to change.

For major refactoring, almost all unit tests will fail and need to be updated. This is because the unit of test will change since methods will change and responsibility will be redistributed between the methods.

From this we get the sense the tests that focus on 'correctness' are very useful while those focus on consistency may or may not be useful. Test that test identicalness are probably not very useful.

Different Types of Tests
Here are a couple test patterns I have seen and wanted to mention.

Mirrors
Testing is often broken into black box and white box. In black box testing you write the tests without knowledge (or pretending you don't have knowledge). In white box testing you test based on what is in the code. I think black box tests are the best tests, but white box tests are often easier to write.

One kind of test is what I call a mirror test. The test is essentially a mirror of the code. I once worked on code developed by some contractors. These contractors had sold the product as fully unit tested <sarcasm> because surely fully unit tested code is going to be high quality </sarcasm>. Anyway, I modified some SQL code in a stored procedure and it triggered a unit test failure. This was obviously a "consistency" based test since the output was correct. I looked at the unit test. The unit test was a direct copy of the SQL in the stored procedure. The output of the unit test running the SQL was simply compared to the output of the stored procedure running the cut and paste SQL. So the test was built to fail if the functionality of the stored procedure changed. To fix the test you simply would paste your new SQL code into the test. This test seemed to not be very useful. You could make the argument that this test is useful if you wanted to refactor the stored procedure, but I think this was still a poor way to create such a test.

Although similar pattern I have seen is the mocking of all calls. Testing frameworks have been built to allow you to intercept and mock external calls. This is useful because you want to isolate code, but this tool can be used to essentially copy all the calls and make a test saying, this method works if it makes all the calls that it happens to make. This tends to be a test of identicalness. Say for example, you discover a new method that works better for your purposes. The test then fails because you didn’t make the exact same calls. This kind of test is like a mirror held up to all the calls of the method.

Checks for Consistent Output
One of the problems with generating tests is that a lot of code does not have 'correct' output. It is also fairly data intensive. For example you may have a piece of code that generates a large section of markup and does a small manipulation on a large data set producing another large data set. Because of this unit tests tend to be focused on 'consistency'. One of the patterns for generating such tests is to write the code and then use the output of the code as the test. This kind of test can be useful for method based refactorings, but usually for any kind of defect fix or feature change it isn't useful.

While it is generally better to make the unit testing process as close to the coding process as possible, with certain forms of refactoring it may make sense to generate tests like this right before refactoring to ensure that no behavioral changes are introduced in the refactoring process.

Trivial
In a previous job I have seen tests on simple properties. These properties almost always had the form of being simple getters and setters on a backing field. Since there were a lot of properties there was a lot of simple code like this. (This was in C# before there were auto-properties.) These tests had a form like "Foo = 1; Assert.AreEqual(Foo,1)". This was done in pursuit of code coverage, but these tests were so trivial as to be meaningless. I don't think generating large amounts of essentially meaningless code is good. Making your codebase larger has a maintenance cost because maintainers have to sort through the code. Tests should be meaningful and not mindless boilerplate.

Another set of trivial tests I have seen is the testing of code that produced markup. The tests tested whether the code produced a non-empty string. This code was most likely produced to generate coverage numbers, but the tests were largely useless. Of course, the nice thing was that they weren't likely to fail.

Complex Tests
This really isn't a category of test, but rather an attribute of tests. We should strive to make simple unit tests. When the complexity shifts into tests it means that the likelihood the problem is in the test as opposed to the code under test increases.


What kind of code is good to test?
Above I talked about how writing tests created incentives to switch to certain patterns. The core pattern needed by testing is dependency injection. I did not talk about it much, but that is a core part of testability and modern programming. Use dependency injection. The other important pattern is pure functions. If code can be encapsulated in pure functions it is easy to test. Code with low coupling and complexity is also nice to test.

Since I talked a lot about that above, here, I want to talk about a distinction I will refer to as 'logic' vs. 'glue'. When I refer to logic I mean code written to manipulate data. By glue I mean code focused on fetching or sending data to another component. I claim that 'logic' code is more complex and requires more testing while 'glue' code is less complex and is much more difficult to test. Now many methods are not all glue or all logic, but all methods are open to refactoring and if you can remove the 'logic' from the 'glue' it becomes much easier to write tests. This is actually one of the general goals of dependency injection and the Law of Demeter (minimize communication). These principles state that you should pass in the minimum you need to a method and everything should be passed in instead of generated inside the method. What this is actually doing is minimizing dependencies by removing the glue. Once all the glue is removed you only have logic. But I don't think DI is the only way to do this and if we want to write tests we should try and isolate logic from glue by moving logic into separate methods. I also feel that this approach means you should make your 'glue' as simple as possible. If the glue part is very simple you have code that is simple to understand and hard to unit test. Tests for this code then have very little value and I argue that instead of testing 'glue' we should isolate it and try to make it as easy as possible and not mix it with 'logic'.

Code Metrics
It is very common for teams to measure how well they are unit testing things by the notion of "code coverage". People like metrics. I like metrics, especially ones that measure code complexity. The key thing with metrics is understanding what they actually measure. So what does code coverage actually measure? You might answer, 'How well your system is tested.' That is incorrect. Coverage measures the number of lines run during a test. You can take a system with 100% coverage, remove all the asserts and you will still have 100% coverage and you will test nothing. You can also have insufficient testing of a complex method but a couple tests manage to cover all the line so it appears fully tested even though this is where you should be doing more testing. You can spend effort doing trivial tests as mentioned above to increase coverage.

So 'coverage' as a goal in and of itself is bad. Coverage, as a tool, can help identify what code is run during unit tests and it can help you find places where it might make sense to add tests, but I think that when you find such a section you should be asking, 'does it make sense to test this?' or 'how can I refactor this code to isolate logic to make testing easier?'

If you use coverage as a metric for driving unit tests it will potentially direct you the wrong way. So if you believe the assumption that some code lends itself more towards unit testing than other code then you have to realize that aiming for code coverage can cause problems. Instead of writing extra tests for complex logic that might already be covered you might write meaningless test for code just because you haven't covered it.  Coverage can be used to help you find places where it might make sense to add tests, but I think that when you find such a section you should be asking, 'does it make sense to test this?' or 'how can I refactor this code to isolate logic to make testing easier?'

Now if you are a contractor and you sell code boasting it has 100% code coverage it makes sense to write a bunch of poorly conceived tests just to obtain a high coverage percentage because that it part of the product. Since our coverage metric isn't part of our product we should not be bound by this.

Conclusion
I am not against unit tests. I am against blindly applying a technique because conventional wisdom has labeled it good. I believe that unit tests vary in quality. I believe the some code makes a lot of sense to test while some code makes little sense to test.

I have several recommendations in terms of testing.

Reduce Dependencies
We should reduce dependencies in our current unit tests. Some of our tests require angular modules to be built. These tests are subject to failures due to reasons outside of the test or the code under test. In my opinion a test failure should almost always be met with changing the code or changing the test. In the case that something external causes a test failure the test should be modified to be less fragile.

Unify Coding and Testing
Unit testing is most effective when combined with coding. The closer the unit testing process is to the coding the more valuable it will be. We should not let ourselves get into the position of unit testing code we do not want to change. This happens when we generate code and release it to QA. We then do not want to change code under test by QA so we do not feel our testing can lead to refactoring.

Don't Test Everything - Isolate Glue
We should be testing complex logic and not glue. To make this easy we should try and isolate logic code and write tests for it. The glue code is hard to test and should be simple enough so that tests wouldn't have much value.

Correctness > Consistency > Identicalness
The lines between these can get blurry, but in general our tests should test correctness as much as possible. Consistency tests are fine, but they are less valuable. Identicalness tests probably should be avoided.

Test for usefulness not for the sake of testing
I have seen many unit tests that have not been very useful but have been written because testing was seen as useful. We should strive to write unit tests. If we fail in that, which we should not do, we should not write bad tests because they will increase maintenance costs while offering no value


No comments:

Post a Comment