My last post described a Linux Loader for ARM Virtual Platforms. Taking a closer look at the code you will see that it's not completely reusable for any ARM design. One of the hard-coded things is the board id. The version I posted has a board id of 0x113, which happens to be for the ARM Integrator CP board. For another system, this field would have to be changed. For example, the Android Goldfish platform, which is not an actual board, but a hypothetical system modeled by the Android emulator, has a board id of 1441.
The following scenario would never happen to me (actually it did), but somebody else (not me) might enter the board id as 0x1411 instead of 1411. This small mistake leads to some useful learning about how the kernel startup process works.
According to the Booting ARM Linux document I referenced last time, section 10 states that R1 must contain the ARM Linux machine type at the start of the boot. You can see in the loader model that the board id is placed in bootloader which is placed into R1.
If this value is not a recognized machine type what happens?
It turns out the kernel doesn't boot. This is somewhat expected, but the unexpected part is that it's a little bit hard to find out why.
There is a good description of the boot process in Appendix A of the book Understanding the Linux Kernel, Third Edition, but it's for x86 machines.
The book defines the phases of the boot as:
Prehistoric Age: the BIOS
Ancient Age: the Boot Loader
Middle Ages: the setup() Function
Renaissance: the startup_32() Functions
Modern Age: the start_kernel() Function
Trying to use a source code debugger to see why the system doesn't boot is tricky because the machine type trouble happens somewhere the Renaissance period, but the debugger doesn't really start providing source level debugging until the Modern Age; at start_kernel(). In this case the software never reaches start_kernel() so there is no source level debugging available. This is all before the MMU is enabled and virtual addressing is being used so the symbol table from vmlinux is not much use, since it contains virtual addresses that all start with 0xCXXXXXXX and the boot process hangs just before this at a low address.
What can be done?
One way to find out why the system is hanging is to resort to assembly language debugging and attempt to find out where in the kernel source things are not working.
After running and stopping in a debugger it is clear the software is stuck in a loop at addresses 0x8158 and 0x815C, looping forever.
Now the challenge is to find out why. Remember, since the machine type was input incorrectly we don't have any clue why the software is stuck.
The next step is to use the addr2line utility to find out where this address resides. The kernel code virtual addresses are the same as the physical addresses except the virtual addresses start with 0xC so we can just add the 0xC to the 0x8158 and run addr2line:
$ arm-none-linux-gnueabi-addr2line -e vmlinux 0xc0008158
Going to this source file and location we can see the exact loop we see in the debugger. After some inspection of this assembly file and recognizing the embedded error messages that seem to be trying to print an error about the machine ID, it seems the problem is with the board id supplied by the loader. There is even a function called __lookup_machine_type just below. Too bad the screen doesn't turn red as the comment indicates for the RiscPC (anybody still have one?). Maybe next time we can find where these message strings are hiding in memory and how to find them.
As I often tell my kids, the best way to learn is by making mistakes and trying to correct them.
The second moral of the story is to pay attention to the difference between decimal and hex. Last year when I had my birthday I told everybody I was 29. After a lot of strange faces, I continued to insist I was 29, in hex!
Have a Merry Christmas and Happy New Year.