on 01-May-2020 01:34
That's a question someone asked me a while ago and while I had a pretty good idea of exec(), read() and write() system calls, I decided to investigate further and to publish an article.
In this article, I'm going through what happens when we type the following command:
I'll be using strace debugging tool to capture the system calls a simple command such as this triggers.
For reference, the process ID (PID) of my bash shell process above is the following:
Also, if you don't know what system calls are, please refer to the Appendix 1 section.
It may seem silly but this is the kind of knowledge that actually made me better at troubleshooting and contributed tremendously to my understanding of Linux OS as this pattern repeats over and over.
As you may know, BIG-IP runs on top of Linux.
First off, I'm using strace command which intercepts and prints system calls called by a process¹ and the signals² received by a process.
If I didn't add the redirection 2>&1, the egrep command wouldn't work because it filters the file descriptor (FD) 1 (stdout) but strace writes to FD 2 (stderr).
Note that I'm attaching strace to my bash shell process ID (4716).
Fir this reason, I added the -f option to capture the shell's behaviour of creating a new child sub-shell process in order to execute ls.
It does that because if Linux were to execute ls directly, ls program would take over current process (bash shell) resources and we would not be able to go back to it once ls executed because ls would be no longer available as it's just been overwritten.
Instead, bash shell creates an exact copy of itself by calling clone() system call and then executes ls so that ls resources are written to this new process and not to parent process.
In fact, this new cloned process becomes ls.
If you don't know what file descriptors and system calls are, please refer to Appendix 1 and 2 at the end.
I can't it's raw output because I've filtered it out but this is what I'm going to explain:
In fact, let's work on the output without the shared libraries:
By default, Linux prompt writes to FD 2 (standard error) which also prints to terminal, just like standard output.
When I hit the letter l on my keyboard, Linux reads from my keyboard and writes back to terminal:
Both read() and write() system calls receive:
What we see here is read() reads from FD 0 (standard input - our keyboard) and writes using write() to FD 2 (standard error) and that ends up printing letter "l" in our terminal.
The return value is what's after the equals sign and for both read() and write() it's the number of bytes read/written.
If there was an error somehow, the return value would be -1.
The clone() system call is used instead of fork() because fork() doesn't allow child process to share parts of its execution context with the calling process, i.e. the one calling fork().
Modern Linux now mostly uses clone() because there are certain resources (such as file descriptor table, virtual memory address space, etc) that are perfectly fine to be shared between parent ↔ child so clone() ends up being more efficient for most cases.
So here, my Debian 10.x uses clone() system call:
Up to this point, the above process is almost an exact replica of our bash shell except that it now has a memory address (stack) of its own as stack memory cannot be shared¹.
flags contains what is shared between the parent process (bash shell) and the new process (the sub-shell that will turn into "ls" command shortly).
The flag CLONE_CHILD_CLEARTID is there to allow another function in the ls code to be able to clean up its memory address.
For this reason, we also have to reference the memory address in child_tidptr=0x7f3ce765ba10 (this 0x.. is the actual memory address of our ls command).
The CLONE_CHILD_SETTID stores the child's PID in memory location referenced by child_tidpt.
Lastly, SIGCHLD is the signal that "ls" process will send to parent process (bash shell) once it terminates.
I had to filter out other system calls to reduce the complexity of this article.
There are other things that happen like memory mappings (using mmap() system call), retrieval of process pid (using getpid() system call), etc.
Except for last 2 lines which is literally reading a blank character from terminal and then closing it, I'd just ignore this bit as it's referring to file descriptors that were filtered:
The important line here is this one:
In reality, execve() doesn't return upon success so I believe the 0 here is just strace signalling there was no error.
What happens here is execve() replaces current virtual address space (from parent process) with a new address space to used independently by ls program.
We now finally have "ls" as we know it loaded into memory!
The next step is for ls command to list the contents of the directory we asked.
In this case, we're listing the contents of current directory which is represented by a dot:
The openat() system call creates a new file descriptor (number 3) with the contents of current directory that we listed and then closes it.
Contents are then written to our terminal using write() system call as shown above.
Note that strace truncates the full list of directories but it displays the correct amount of bytes written (62 bytes).
If you're wondering why FD 3 is closed before ls writes its contents to FD 1 (stdout), keep in mind that strace output is not the actual ls code!
It's just the system calls, i.e. when code needs access to a privileged kernel operation.
This snippet from ls.c from Linux coreutils package, shows that ls code has a function called print_dir and inside such function, it uses a native C library function opendir() to store the contents of the directory into a variable called dirp.
In reality, it's not the directory's content but a pointer to it.
The openat() system call is triggered when print_dir function executes opendir() as seen below:
The bottom line is that strace will only show us what is going on from the point of view of system calls.
It doesn't give us a complete picture of everything that's going on in ls code.
So to answer our question, opendir() function only uses openat() system call to have access to the contents of current directory.
It can then copy it to variable and close it immediately.
After program closes, Linux prints our terminal prompt back to us:
The Linux OS is responsible for management devices, processes, memory and file system.
It won't or at least try hard not to let anything coming from the user space to disrupt the health of our system.
Therefore, for the most part, tasks like allocating memory, reading/writing from/to files use the kernel as intermediate.
So, even printing a hello world in C can trigger a write() system call to write "Hello World" to our terminal.
This is all I did:
And this is the output of strace filtering only write() system calls:
So think of it as Linux trying to protect your computer resources from programs and the end user such as us using a safe API.
Every program comes with 3 standard file descriptors: 0 (standard input), 1 (standard output) and 2 (standard error).
These file descriptors are present in a table called file descriptor table that tracks open files for all our programs.
When our "Hello World" was printed above, the write() system call "wrote" it to file descriptor 1 (standard output).
By default, file descriptor 1 prints to terminal.
On the other hand, file descriptor 0 is used by read() system call.
I didn't hit enter here, but I just wanted to prove that read() takes file descriptor 0 as input:
It's reading from standard input (0), i.e. whatever we type on keyboard.
Standard error (2) is reserved for errors.
From FD 3 onwards, programs are free to use if they need to.
When we open a file, such file is assigned the next lowest file descriptor number available, which might be 3 for first file, 4 for second file, and so on.