Over the years, when faced with a large body of code, I have approached it as I approach any technical reading task: an exercise in "concept extraction" - a De Bono simplicity technique. The essence of the technique is to try to reduce the sheer size of the problem to a manageable size - a variation of the "how do you eat an elephant" theme. In this post I will deal with reading very large amount of code. Reading a small amount of code (under 10,000 lines), line-by-line, still has different challenges, and is best done in an IDE, like Eclipse, using browsing, searching, and cross referencing facilities, together with some reverse engineering of classes and method executions. But before we get there we have to deal with the problem of size first.
First an anecdotal background to set the stage for the need for a code reading methodology. The largest code reading task I had was about 2 million lines of "legacy" Java code (about 5 years worth of development by about 200 developers) that I had to wrap my mind around when I joined a project at a large financial company. The documentation (although I was given three binders of it) was a couple of years behind the code, and I was told by the chief architect (who hired me) that if I read it, I will get the wrong idea about what the software does. I was part of the software architecture group (one chief architect, one system architect, and 15 subsystem architects). When I was hired, I was promised that I would have 2 hours a day with the system architect for a month, and any time I needed from any architect in the group). I was also promised time with each of 15 team leads and 15 project managers (each subsystem had a subsystem arch, a team lead, and a subsystem project manager as well as a build master). After a month on the job the total time I was able to get from all of these people combined was 3.5 hours. From the system architect, whose boss hired me, I got exactly half an hour - and that remained the average (half an hour per month) for my entire 19 month stint at that project. So the short version of this story is: you can't rely on "knowledge transfer" from others to get to know the system. My job, with an official title of "Subsystem Architect" was really "Performance Analyst". I was to be the software architects' representative with the performance testing and the engineering team (2 distinct teams one does the testing, the other performance modeling and capacity planning). The performance test leads, performance testers, and performance engineers, looked to me for "application knowledge". Since I was hired in from the outside, and have never worked on this project, I needed to ramp up very rapidly, and reading the source code was the only way to do it. Documentation was old. People were not available, and were not allowed to be available. I was not a priority to any of them. The urgency of the task was made clear when I was told that I was part of a 13 person "performance triage team" that had a "check-in" meeting every day at 10:30 am. Each and every day for three months! I was the "look to" guy for "application knowledge". I was greener than a broccoli and colder than an ice cube! You do the math on pressure per square inch.
Faced with 2 million lines of code, documentation you were told to avoid, because it was misleading, and UML that is 2-years behind (they had a full Rose model), how do you learn the system?
The obvious approach, reading the code line by line using a text editor or an IDE, will not be much help. It rapidly becomes overwhelming and lead to frustration and discouragement eventually leading to abandoning the effort. Reading line by line comes at later stage, after the scope of the task has been narrowed.
So here was my general approach, step by painful step:
Step 0. Preliminary: Use the application. Get familiar with what the application does from its user's point of view. Understand the operating concepts. If you can get time with key people, that would be great. In my case I got half an hour with the system architect, and the URL for the app. The architect did a one-page sketch on an 8 1/2 by 11 sheet stating that the system is basically very simple: we have to buy loans and contracts from banks - here is how it flows. The GUI had 110 use cases - but the basic ones are less than a dozen.
Step 1. Inventory the code. What do we have?
- Number of files
- Total lines of code
- Number of classes
- Average size of class (methods per class)
- Number of methods
- Average size of method (lines per method).
I used some tools that were freely available (Understand for Java) was one - for its trial period. I ended up just writing simple scripts to do the counting.
Step 2. Get an overview of the source code tree. What is there other than Java files? Config files, metadata files, XML, HTML,
Step 3. Understand the build process. What are the build-units (build chunks/subsystems/deployment packaging/...)
Step 4. Inventory the databases
- Number of databases
- Number of distinct schemas
- Tables per schema
- Average size of tables (columns per table)
I usually produce a "distilled schema" with this format: <tablename>(colum1, column2, ...) one line per table (only about six or seven columns - pk plus a few more). Essentially a table to me is <tablename>(primary key, info). You can put most databases on one or two pages. It is useful to show foreign keys as part of the info. A very large database (several hundred tables) would reduce to a a few pages.
Step 5. Inventory the main concepts. The database schema is the best source for that. If there is a domain model in UML that would be very nice. Distill the concepts down as much as possible. Most systems narrow down to about a dozen major concepts, while there may be hundreds of tables in the schema, and hundreds of classes in the domain model.
These basic preliminaries give you the big picture. The very big picture. I usually write scripts (mainly in AWK and Korn Shell - even on a PC with MKS toolkit - can't live without it). The scripts filter and summarize the code down to a digestible size.
A most useful script is one that extracts class names and sorts them by suffix. Most systems have class name patterns similar to the following:
XxxxService
XxxxBusinessController
XxxxFlowController
XxxxHandler
XxxxHelper
XxxxBusinessObject
XxxxLocator
XxxxAdapter
XxxxFactory
XxxxAction
XxxxForm
XxxFacade
XxxxBean
XxxxDAO
XxxxDTO
...
XxxxFoo
XxxxBar
XxxxThing
XxxxEntity
XxxxWhatever
....
I like to use AWK embedded in Korn Shell, but you can write a Java program, embedded in an Ant script, or use your favorite scripting language - Perl, Jython, whatever. If scripting gets too hard, I put the code in a database, and analyze it with SQL. On one project I had code for 1000 screens written in an HP 4GL proprietary language that I did not know. The language was pretty declarative and quite elegant. I wrote an AWK parser (simple one) and loaded the code into a relational database whose schema reflected the structure of the 4GL language. Then I wrote a GUI to browse through the code and extract requirements for the new system. The purpose of the project was to port the system fro the proprietary language to Forte - an OO 4GL.
Once you've sorted the class names by suffix, you can determine the number of distinct types of classes the system has. This list of suffixes is a great help in classifying the major concepts (and patterns) used to implement the system. Since there is a tremendous amount of repetition in the implementation (a major implementation pattern is usually repeated hundreds of time), this summarization alone will reveal most of the system secrets. In the financial system, there was a basic pattern of BizController, BizHelpers, BOs, DAOs, and DTOs, repeated 110 times, once for each use case. The architecture standards did not allow deviation from the basic pattern. I could have been saved tens of hours, if one of the architects had clued me in. But, as explained earlier, no one had time to meet with me. Actually, after I arrived at this pattern that most use cases used, and presented it to the architects, they were in denial! No way, our system is vastly more complex than that! Another side lesson you learn, when you start valuing simplicity, is that the "complexity priesthood" will not like what you say! They have a vested interest in complexity. They are the guardians of complexity. If things were suddenly to become simple, they could be out of a job!
The output of this script, a list of the distinct suffixes, and the supporting detail list of the all the classes, becomes "the index" to the system, and a guide to reading it.
Step 6. You can use reverse engineering tools to study the classes. I find the simplest UML the most helpful (just class names, no attributes or methods). For the main use cases, a sequence diagram of the major method can be helpful - if you have a good reverse engineering tool that can reduce the size (which can be overwhelmingly large - some tools will choke on it).
Step 7. Once you have put your arm around the big picture and you understand the "pattern of patterns", so to speak, you can start reading selectively. This is the time to fire up the IDE and start reading. I usually do that one use case at a time - guided by the system's operational concept, or its UI.
Step 8. My main reading goal still remains "concept extraction". What is this trying to do? Why is it doing it this way?
Step 9. One thing that can help is "exploratory testing". Read using the IDE and JUnit tests. Write test cases to explore what the code is doing. This may get complex if the code has external dependencies (accessing external services, writing to databases). You may have to use mock objects to fake any external dependency.
Step 10. Another approach that is useful is "mock refactoring". While reading the code, if it is hard to read, I go ahead and refactor it for readability. The most important refactoring is "extract method", "move method", and "introduce parameter object". Since the body of code is read-only for me, I won't re-check it back in the source code control, I slice it and dice it and refactor it (if it is poorly written).
I think a lot of these steps can be programmed as Eclipse plug-ins. If you know of any, please let me know.
Recent Comments