systemcalls
1 TopicWhat really happens under the hood when we type 'ls' on Linux?
Quick Intro That's a question someone asked me a while ago and while I had a pretty good idea of exec(), read() and write() system calls, I decided to investigate further and to publish an article. In this article, I'm going through what happens when we type the following command: I'll be usingstracedebugging tool to capture the system calls a simple command such as this triggers. For reference, the process ID (PID) of my bash shell process above is the following: Also, if you don't know what system calls are, please refer to the Appendix 1 section. It may seem silly but this is the kind of knowledge that actually made me better at troubleshooting and contributed tremendously to my understanding of Linux OS as this pattern repeats over and over. As you may know, BIG-IP runs on top of Linux. The strace command First off, I'm usingstracecommand whichinterceptsand prints system callscalledby a process¹and the signals²receivedby a process. If I didn't add the redirection2>&1, the egrep command wouldn't work because it filters the file descriptor (FD) 1 (stdout) butstracewrites to FD 2 (stderr). Note that I'm attachingstraceto my bash shell process ID (4716). Fir this reason, I added the-foption to capture the shell's behaviour of creating a new child sub-shell process in order to executels. It does that because if Linux were to executelsdirectly,lsprogram would take over current process (bash shell) resources and we would not be able to go back to it oncelsexecuted becauselswould be no longer available as it's just been overwritten. Instead, bash shell creates an exact copy of itself by callingclone()system call and then executeslsso that ls resources are written to this new process and not to parent process. In fact, this new cloned processbecomesls. Interesting eh? ¹A process is a running instance of a program we execute. ²signals are software interrupts that one process can send to another or to a group of processes. A well known example is kill signal when a process hangs and we want to force a termination of a program. If you don't know what file descriptors and system calls are, please refer to Appendix 1 and 2 at the end. Strace output I can't it's raw output because I've filtered it out but this is what I'm going to explain: In fact, let's work on the output without the shared libraries: Typing "ls" and hitting Enter By default, Linux prompt writes to FD 2 (standard error) which also prints to terminal, just like standard output. When I hit the letterlon my keyboard, Linux reads from my keyboard and writes back to terminal: Both read() and write() system calls receive: file descriptor they're reading from/writing to as first argument the character as second argument the size in bytes of the character What we see here isread()reads from FD 0 (standard input - our keyboard) and writes using write() to FD 2 (standard error) and that ends up printing letter "l" in our terminal. The return value is what's after the equals sign and for both read() and write() it's the number of bytes read/written. If there was an error somehow, the return value would be -1. Bash shell creates a new process of itself The clone() system call is used instead offork()becausefork()doesn't allow child process to share parts of its execution context with the calling process, i.e. the one calling fork(). Modern Linux now mostly usesclone()because there are certain resources (such as file descriptor table, virtual memory address space, etc) that are perfectly fine to be shared between parent↔ child soclone() ends up being more efficient for most cases. So here, my Debian 10.x usesclone()system call: Up to this point, the above process is almost an exact replica of our bash shell except that it now has a memory address (stack) of its own as stack memory cannot be shared¹. flags contains what is shared between the parent process (bash shell) and the new process (the sub-shell that will turn into "ls" command shortly). The flagCLONE_CHILD_CLEARTIDis there to allow another function in the ls code to be able to clean up its memory address. For this reason, we also have to reference the memory address inchild_tidptr=0x7f3ce765ba10(this 0x.. is the actual memory address of ourlscommand). TheCLONE_CHILD_SETTIDstores the child's PID in memory location referenced bychild_tidpt. Lastly,SIGCHLDis the signal that "ls" process will send to parent process (bash shell) once it terminates. ¹A stack is the memory region allocated to a running program that contains objects that are statically allocated such as functions and local variables. There's another region of memory called the heap that store dynamic objects such as pointers. Stack memory is fast and automatically frees memory. Heap memory requires manual allocation using malloc() or calloc() and freeing using free() function. For more details, please refer tothis article here. Execution of ls finally starts I had to filter out other system calls to reduce the complexity of this article. There are other things that happen like memory mappings (using mmap() system call), retrieval of process pid (using getpid() system call),etc. Except for last 2 lines which is literally reading a blank character from terminal and then closing it, I'd just ignore this bit as it's referring to file descriptors that were filtered: The important line here is this one: In reality,execve()doesn't return upon success soI believe the 0 here is juststracesignalling there was no error. What happens here is execve() replaces current virtual address space (from parent process) with a new address space to used independently bylsprogram. We now finally have "ls" as we know it loaded into memory! ls looks for content in current directory The next step is forlscommand to list the contents of the directory we asked. In this case, we're listing the contents of current directory which is represented by a dot: Theopenat()system call creates a new file descriptor (number 3) withthe contents of current directory that we listed and then closes it. Contents are then written to our terminal using write() system call as shown above. Note that strace truncates the full list of directories but it displays the correct amount of bytes written (62 bytes). If you're wondering why FD 3 is closed before ls writes its contents to FD 1 (stdout), keep in mind thatstraceoutput is not the actuallscode! It's just the system calls, i.e. when code needs access to a privileged kernel operation. This snippet from ls.cfrom Linuxcoreutilspackage, shows thatlscode has a function calledprint_dirand inside such function, it uses a native C library functionopendir() to store the contents ofthe directory into a variable calleddirp. In reality, it's not the directory's content but a pointer to it. Theopenat()system call is triggered whenprint_dirfunction executesopendir()as seen below: The bottom line is thatstracewill only show us what is going on from the point of view of system calls. It doesn't give us a complete picture of everything that's going on inlscode. So to answer our question,opendir()function only usesopenat() system call to have access to the contents of current directory. It can then copy it to variable and close it immediately. Terminal prompt gets printed back to us After program closes, Linux prints our terminal prompt back to us: Appendix 1 -What are System Calls? The Linux OS is responsible for management devices, processes, memory and file system. It won't or at least try hard not to let anything coming from the user space to disrupt the health of our system. Therefore, for the most part, tasks like allocating memory, reading/writing from/to files use the kernel as intermediate. So, even printing a hello world in C can trigger a write() system call to write "Hello World" to our terminal. This is all I did: And this is the output of strace filtering onlywrite()system calls: So think of it as Linux trying to protect your computer resources from programs and the end user such as us using a safe API. Appendix 2 - What are File Descriptors? Every program comes with 3 standard file descriptors: 0 (standard input),1 (standard output) and 2 (standard error). These file descriptors are present in a table called file descriptor table that tracks open files for all our programs. When our "Hello World" was printed above, the write() system call "wrote" it to file descriptor 1 (standard output). By default, file descriptor 1 prints to terminal. On the other hand, file descriptor 0 is used byread()system call. I didn't hit enter here, but I just wanted to prove thatread()takes file descriptor 0 as input: It's reading from standard input (0), i.e. whatever we type on keyboard. Standard error (2) is reserved for errors. From FD 3 onwards, programs are free to use if they need to. When we open a file, such file is assigned the next lowest file descriptor number available, which might be 3 for first file, 4 for second file, and so on.4.4KViews2likes0Comments