Understanding Hello World – on Raspberry Pi

Like so many other people, over the last few weeks of lockdown, I’ve been trying to use the extra time that I’m not spending commuting to work to learn some new skills that I’ve always meant to get around to learning. In particular, I been teaching myself ARM Assembly Language programming.

I’ve always been more interested in the low-level side of computer programming – than I have been in applications. I’ve always (even as a child) wanted to know how the magic box that is the computer actually works. I’m not new to assembly programming: I’ve done all kind of odds and ends of assembly over the years: from (emulated) 6502 on the KIM-Uno, to the real thing on a BBC model B; as well as more recently taking a course looking at the (also emulated) Atari 2600, and following along with Ben Eater’s fabulous breadboard 6502 series. The observant amongst you will note a common-thread there – the iconic 6502 processor (yes; I was an ‘80s kid). In addition to this, I have also done some assembly programming for PIC microcontrollers, and (rather more years ago than I’d care to admit!) back when I was doing my A-Levels in college I learnt a little x86 too; but I’ve never done anything much with ARM CPUs. Given my professional interests in the Internet of Things & cybersecurity, I thought that learning ARM assembler would be fun (and potentially useful in the future too, perhaps).

Interacting with the Operating System

Unlike some of the older computer systems that I mentioned earlier, the vast majority of microprocessors (and even some microcontrollers) today run some sort of operating system. The point of an operating system is easily missed today because of their ubiquity; but they essentially exist provide services to user programs. Without an OS, every application would have to include it’s own code to drive all of the peripheral devices: vital functionality such as reading a keyboard, displaying things on a screen, and providing an abstraction of storage devices to enable users to work with the concept of files.

In 2020 there are basically two types of OS in regular use. There are the POSIX (Portable Operating System Interface) standard compliant OS (as implemented in Linux, MacOS, BSD: and to some extent in iOS and Android too); and Windows. Given that I am not a Windows user (apart from for Office stuff at work) – the choice here was easy. So for my learning environment, I pulled out a Raspberry Pi (running Raspbian Linux): with its ARM Cortex-A72 processor.

I’ve heard it said that in order to learn to program in C, you need to understand everything – before you can do anything; I’m not sure that’s really true for C, but I’d argue that it’s undeniably for assembly language. When learning to program in a new language, the canonical introductory program is is Hello World; but in many assembler books it’s often one of the very last things you’ll see. There are some good reasons for that: not least of which is that because there’s almost no abstraction from the underlying OS when it comes to printing a message on the screen. Since we don’t have a handy print() function: we have to set about directly request that the OS shows our message to the user.

Hello, World; and down the rabbit-hole…

Since ultimately it’s the OS that’s going to be showing our message (regardless of the language we’re programming in) – we can start with a slightly easier problem – and work backwards to where we want to be… so let’s start with a simple Hello World program in C.

// hello.c
#include <stdio.h>

int main()
	printf("Hello World!\n");
	return 0;

If you’ve ever written any C code before – you’ll immediately see that this is probably the simplest (useful) program that you can write. Now, of course, when we write a program in C we have to compile it – turning it from the nicely human-readable C-code, into actual machine-code: the binary 1s and 0s that the processor itself can actually execute.

Given that there is (by definition) a one-to-one mapping between assembler mnemonics, and machine-code instructions – we can turn any executable programme into its assembly code equivalent by simply disassembling it. So let’s have a go at that…

If we build the code with gcc hello.c -o hello, we get an executable which does what we’d expect; and we can disassemble it using objdump -d hello.

If we look at that, we’ll see that there is a lot of assembly code: far more than we perhaps might expect for a simple program – and most of which (unhelpfully for us) isn’t really anything to do with our Hello World.

Regardless of the processor architecture – there are a few things that any CPU has in common. There will be one or more (usually more!) registers (tiny memory locations inside the CPU itself) which can be used to store data and perform operations on it; there’ll be some instructions to load data into those registers (either directly from the code, other registers, or from memory), and to store the data back into memory; and some control instructions to allow for things like branches and loops.

With this in mind, we can find the section corresponding to our main() function (<main> in the disassembly): we’ll see that (amongst other things) it looks like it calls (branches to – using the bl instruction) something called puts (which might ring a bell if we’ve ever used the C library function puts()).

00010404 <main>:
   10404:       e92d4800        push    {fp, lr}
   10408:       e28db004        add     fp, sp, #4
   1040c:       e59f000c        ldr     r0, [pc, #12]   ; 10420 <main+0x1c>
   10410:       ebffffb3        bl      102e4 <puts@plt>
   10414:       e3a03000        mov     r3, #0
   10418:       e1a00003        mov     r0, r3
   1041c:       e8bd8800        pop     {fp, pc}
   10420:       00010494        .word   0x00010494

Since we’re expecting the code to ask the OS to print the message for us, let’s now use the Linux strace command to see the system call that our code ends up making: using strace ./hello.

Again there’s lots of output which isn’t really relevant ; but right at the bottom we’ll see something that finally looks like it might be starting to be useful…

write(1, "Hello World!\n", 13Hello World!)          = 13

(As an aside, if we build our program with the -static compiler flag – e.g. gcc hello.c -static -o hello – then we’ll have a much shorter output, as the program has fewer external library calls to make. And we’ll come back to this in a moment).

So our original printf() function call has eventually been mapped down to a write() system call (with the printing happening as its called – in the middle of the output: hence the slightly messy output).

write(1, "Hello World!\n",13)=13

Let’s have a look at the man page for the write() call.

I should note here, that somewhat unhelpfully for us here, there’s also a write command – which gets presented by default – so to find what we’re actually looking for we need to use: man 2 write.

The man page for the write command, is in section 1; whereas the one we want is in section 2.

Once we have the right page, (in the Linux Programmer’s Manual) we see the write()function – which is C’s wrapper for the underpinning system call.

       write - write to a file descriptor
       #include <unistd.h>
       ssize_t write(int fd, const void *buf, size_t count);
       write()  writes  up to count bytes from the buffer 
       starting at buf to the file referred to by the file
       descriptor fd.

Now we have this knowledge; we can drop a bit further down the metaphorical stack – and try calling this directly function directly.

This function is from unistd.h, and that’s properly into POSIX-only territory. The unistd.hheader is at the heart of the POSIX interface – essentially defining the C-wrapper to the underpinning OS’s built-in routines to actually do stuff.

So here’s our new C code example:

// hello2.c
#include <unistd.h>

int main()
	write(1, "Hello World!\n",13);
	return 0;

Note that we have to pass a couple of additional parameters to the write() function compared with printf(): the file descriptor for the file we want to write to; and the length of the string we want to write. In Linux pretty much everything is a file: including the display or stdout as it’s known; and stdout has the file descriptor of 1.

If we build and run this – we see that we get the same output as before… So far, so good. So let’s have another look with objdump: objdump -d ./hello2

This time the resulting output is smaller – but there’s still plenty of it.

Our <main> section this time contains a branch to something else – write@ptr.

00010404 <main>:
   10404:       e92d4800        push    {fp, lr}
   10408:       e28db004        add     fp, sp, #4
   1040c:       e3a0200d        mov     r2, #13
   10410:       e59f1010        ldr     r1, [pc, #16]   ; 10428 <main+0x24>
   10414:       e3a00001        mov     r0, #1
   10418:       ebffffb7        bl      102fc <write@plt>
   1041c:       e3a03000        mov     r3, #0
   10420:       e1a00003        mov     r0, r3
   10424:       e8bd8800        pop     {fp, pc}
   10428:       0001049c        .word   0x0001049c

Similarly; if we use strace we see another call to that write() function.

So let’s build a statically linked version of the program, so that we can see more of what’s going on. As I mentioned previously, this will result in a version of the program being built which contains it’s own copy of all of the library code it needs – this will be much larger than the default, dynamically linked, version – but it should show us more of what’s going on. To do this we use: gcc hello2.c -static -o hello2.

As you can see the new version is nearly 62x larger – which is one reason why static linking isn’t the default!

If we run it, it still gives the same output as before – but if we look at it in objdump we see that we are nearing our destination – as we’re now seeing the write function within the core C library itself.

000104cc <main>:
   104cc:       e92d4800        push    {fp, lr}
   104d0:       e28db004        add     fp, sp, #4
   104d4:       e3a0200d        mov     r2, #13
   104d8:       e59f1010        ldr     r1, [pc, #16]   ; 104f0 <main+0x24>
   104dc:       e3a00001        mov     r0, #1
   104e0:       eb005b84        bl      272f8 <__libc_write>
   104e4:       e3a03000        mov     r3, #0
   104e8:       e1a00003        mov     r0, r3
   104ec:       e8bd8800        pop     {fp, pc}
   104f0:       0005e6d8        .word   0x0005e6d8

If we really wanted to keep going, we could dig into the disassembly of libc (which on this Raspberry Pi lives in /lib/arm-linux-gnueabihf/libc.so.6). If we do that we eventually see <__write@@GLIBC_2.4>defined, which contains the line of assembler that shows we’re at the bottom of our rabbit hole. We see the assembly instructionsvc 0x0000...; this is the supervisor call where we actually hand back control from our program to the OS to do the printing.

Linux System Calls

The POSIX standard defines the system calls (or syscalls) that we can use. The latest version of that can be found here: https://pubs.opengroup.org/onlinepubs/9699919799/; but it’s not exactly easy reading; we can see a far easier list of the available system calls for Linux here: https://syscalls.kernelgrok.com/.

Right near to the top of the list we see #4 – sys_write ; the parameters for which correspond to our write() function, as we’d expect – give that write() is just a wrapper for the syscall).

So let’s now go back to C and skip the middle-man, and call the system directly. C provides us with a syscall() function, to do this – so let’s take a look at this: man syscall

SYSCALL(2)         Linux Programmer's Manual         SYSCALL(2)
       syscall - indirect system call
       #define _GNU_SOURCE      /* See feature_test_macros(7) */
       #include <unistd.h>
       #include <sys/syscall.h> /* For SYS_xxx definitions */
       long syscall(long number, ...);
       syscall() is a small library function that invokes the 
       system call whose assembly language interface has the 
       specified number with the specified arguments.  
       Employing syscall() is useful, for example, when invoking
       a system call that has no wrapper function in the C 

As we saw from the syscall table, there are four things we need to provide: the number of the syscall we want to use, the file descriptor of the place we’re writing to (here, the number 1 corresponding to stdout), the string itself, and the length of that string.

#include <unistd.h>
#include <sys/syscall.h>

int main()
	syscall(SYS_write, 1, "Hello World!\n",13);
	return 0;

Compiling and running it, we see (once again) the same output.

Incidentally if you could be bothered to track down the sys/syscall.h and then follow all of the chain of includes that it then itself includes (it’s quite a journey, so I’ll spare you the trouble), you’d eventually see that SYS_write is eventually defined as having a value of 4 (which isn’t surprising – given that it is the syscall number shown in the table we saw earlier).

So we can (if we want to write really non-portable code) use:

#include <unistd.h>
#include <sys/syscall.h>

int main()
	syscall(4, 1, "Hello World!\n",13);
	return 0;

Which will again do exactly the same thing.

Turning it back into assembly

If we build using the -static flag again, and use objdump -d one more time, we’ll finally see what is going on, and how the information gets passed to the OS; and finally get to see the actual syscall in the disassembly.

000104cc <main>:
   104cc:       e92d4800        push    {fp, lr}
   104d0:       e28db004        add     fp, sp, #4
   104d4:       e3a0300d        mov     r3, #13
   104d8:       e59f2014        ldr     r2, [pc, #20]   ; 104f4 <main+0x28>
   104dc:       e3a01001        mov     r1, #1
   104e0:       e3a00004        mov     r0, #4
   104e4:       eb005ee9        bl      28090 <syscall>
   104e8:       e3a03000        mov     r3, #0
   104ec:       e1a00003        mov     r0, r3
   104f0:       e8bd8800        pop     {fp, pc}
   104f4:       0005e718        .word   0x0005e718


00028090 <syscall>:
   28090:       e1a0c00d        mov     ip, sp
   28094:       e92d00f0        push    {r4, r5, r6, r7}
   28098:       e1a07000        mov     r7, r0
   2809c:       e1a00001        mov     r0, r1
   280a0:       e1a01002        mov     r1, r2
   280a4:       e1a02003        mov     r2, r3
   280a8:       e89c0078        ldm     ip, {r3, r4, r5, r6}
   280ac:       ef000000        svc     0x00000000
   280b0:       e8bd00f0        pop     {r4, r5, r6, r7}
   280b4:       e3700a01        cmn     r0, #4096       ; 0x1000
   280b8:       312fff1e        bxcc    lr
   280bc:       ea000ac7        b       2abe0 <__syscall_error>

From here we have nearly everything we need to do this in assembler from scratch.

Since we’re trying to build the minimal working example, we’ll start by stripping out all of the parts that we don’t absolutely need out of the disassembly…

As I said before, all assembly language programming, regardless of the architecture, is about getting data from one place or another and storing it in one of our available registers. In ARM we have registers numbered from r0 to r15 (although some of these have special purposes). Looking at our code, we can see the move instruction mov is used a lot – with the destination first, then the source. We can see that we’re putting the length of the message into r3, the file descriptor for stdout into r1; and the syscall number into r0. We’re also loading (ldr) the address of the message into r2.

000104cc <main>:
   mov     r3, #13
   ldr     r2, [pc, #20]   ; 104f4 <main+0x28>
   mov     r1, #1
   mov     r0, #4
   bl      28090 <syscall>

00028090 <syscall>:
   mov     r7, r0
   mov     r0, r1
   mov     r1, r2
   mov     r2, r3
   svc     0x00000000

We can also see that most of the <syscall> section is just moving data from one register to another before it’s used. So, given that, we can just put the data into the right registers to start with…

   mov     r2, #13
   ldr     r1, [pc, #20]   ; 104f4 <main+0x28>
   mov     r0, #1
   mov     r7, #4
   svc     0x00000000

Now this is more like it! We’re down to just five lines of assembler: but to actually build it (and to actually define the message we want to print!) – we need to add just a little more code around it.

@ hello_a.s

	.section .rodata
	.ascii "Hello World!\n"
	.align 2

	.global _start

	mov r0, #1 		@ stdout
	ldr r1, =msg	@ location of msg
	mov r2, #13		@ length of msg	
	mov r7, #4 		@ write
	svc #0

	mov r7, #1		@ exit syscall
	mov r0, #0		@return value
	svc #0

And so, now, we finish up with what is (I think) the minimum working example of a Hello World program in ARM assembler.

We start by defining our message (here labelled as msg) in a “read-only data” section (ensuring it’s aligned to a 2-byte word), then we define our entry-point _start, which contains our short program. I’ve re-ordered the instructions into a more sensible order too. Lastlyto finish, we make a second syscall (#1) to exit the program cleanly. This latter syscall (also known as sys_exit, has just one parameter, which we set in r0– and that’s the return value of the program. As with the standard way of working on POSIX systems – a 0 denotes a successful execution, and anything else represents an error condition.

And that’s it. Hello World in ARM assembler. Next time, we’ll look at some other syscalls, and have a go at writing a useful (if wildly impractical) program in assembler.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.