Return to LinuxDig.Com HowTo's

Linux i386 Boot Code HOWTO

Feiyun Wang

2004-01-23

Revision History
Revision 1.02004-02-19Revised by: FW
Initial release, reviewed by LDP
Revision 0.3.32004-01-23Revised by: fyw
Add decompress_kernel() details; Fix bugs reported in TLDP final review.
Revision 0.32003-12-07Revised by: fyw
Add contents on SMP, GRUB and LILO; Fix and enhance.
Revision 0.22003-08-17Revised by: fyw
Adapt to Linux 2.4.20.
Revision 0.12003-04-20Revised by: fyw
Change to DocBook XML format.

This document describes Linux i386 boot code, serving as a study guide and source commentary. In addition to C-like pseudocode source commentary, it also presents keynotes of toolchains and specs related to kernel development. It is designed to help:

  • kernel newbies to understand Linux i386 boot code, and

  • kernel veterans to recall Linux boot procedure.


1. Introduction

This document serves as a study guide and source commentary for Linux i386 boot code. In addition to C-like pseudocode source commentary, it also presents keynotes of toolchains and specs related to kernel development. It is designed to help:

  • kernel newbies to understand Linux i386 boot code, and

  • kernel veterans to recall Linux boot procedure.

Current release is based on Linux 2.4.20.

The project homepage for this document is hosted by China Linux Forum. Working documents may also be found at the author's personal webpage at Yahoo! GeoCities.


1.1. Copyright and License

This document, Linux i386 Boot Code HOWTO, is copyrighted (c) 2003, 2004 by Feiyun Wang. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is available at http://www.gnu.org/copyleft/fdl.html.

Linux is a registered trademark of Linus Torvalds.


1.2. Disclaimer

No liability for the contents of this document can be accepted. Use the concepts, examples and information at your own risk. There may be errors and inaccuracies which could be damaging to your system. Proceed with caution, and although this is highly unlikely, the author(s) do not take any responsibility.

Owners hold all copyrights, unless specifically noted otherwise. Use of a term in this document should not be regarded as affecting the validity of any trademark or service mark. Naming of particular products or brands should not be seen as endorsements.


1.3. Credits / Contributors

In this document, I have the pleasure of acknowledging:

Names will remain on this list for a year.


1.4. Feedback

Feedback is most certainly welcome for this document. Send your additions, comments and criticisms to the following email address:


1.5. Translations

English is the only version available now.


2. Linux Makefiles

Before perusing Linux code, we should get some basic idea about how Linux is composed, compiled and linked. A straightforward way to achieve this goal is to understand Linux makefiles. Check Cross-Referencing Linux if you prefer online source browsing.


2.1. linux/Makefile

Here are some well-known targets in this top-level makefile:

  • xconfig, menuconfig, config, oldconfig: generate kernel configuration file linux/.config;

  • depend, dep: generate dependency files, like linux/.depend, linux/.hdepend and .depend in subdirectories;

  • vmlinux: generate resident kernel image linux/vmlinux, the most important target;

  • modules, modules_install: generate and install modules in /lib/modules/$(KERNELRELEASE);

  • tags: generate tag file linux/tags, for source browsing with vim.

Overview of linux/Makefile is outlined below:
include .depend
include .config
include arch/i386/Makefile

vmlinux: generate linux/vmlinux
        /* entry point "stext" defined in arch/i386/kernel/head.S */
        $(LD) -T $(TOPDIR)/arch/i386/vmlinux.lds -e stext
        /* $(HEAD) */
        + from arch/i386/Makefile
                arch/i386/kernel/head.o
                arch/i386/kernel/init_task.o
        init/main.o
        init/version.o
        init/do_mounts.o
        --start-group
        /* $(CORE_FILES) */
        + from arch/i386/Makefile
                arch/i386/kernel/kernel.o
                arch/i386/mm/mm.o
        kernel/kernel.o
        mm/mm.o
        fs/fs.o
        ipc/ipc.o
        /* $(DRIVERS) */
        drivers/...
                char/char.o
                block/block.o
                misc/misc.o
                net/net.o
                media/media.o
                cdrom/driver.o
                and other static linked drivers
                + from arch/i386/Makefile
                        arch/i386/math-emu/math.o (ifdef CONFIG_MATH_EMULATION)
        /* $(NETWORKS) */
        net/network.o
        /* $(LIBS) */
        + from arch/i386/Makefile
                arch/i386/lib/lib.a
        lib/lib.a
        --end-group
        -o vmlinux
        $(NM) vmlinux | grep ... | sort > System.map
tags: generate linux/tags for vim
modules: generate modules
modules_install: install modules
clean mrproper distclean: clean up build directory
psdocs pdfdocs htmldocs mandocs: generate kernel documents

include Rules.make

rpm: generate an rpm
"--start-group" and "--end-group" are ld command line options to resolve symbol reference problem. Refer to Using LD, the GNU linker: Command Line Options for details.

Rules.make contains rules which are shared between multiple Makefiles.


2.2. linux/arch/i386/vmlinux.lds

After compilation, ld combines a number of object and archive files, relocates their data and ties up symbol references. linux/arch/i386/vmlinux.lds is designated by linux/Makefile as the linker script used in linking the resident kernel image linux/vmlinux.

/* ld script to make i386 Linux kernel
 * Written by Martin Mares <mj@atrey.karlin.mff.cuni.cz>;
 */
OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
/* "ENTRY" is overridden by command line option "-e stext" in linux/Makefile */
ENTRY(_start)
/* Output file (linux/vmlinux) layout.
 * Refer to Using LD, the GNU linker: Specifying Output Sections */
SECTIONS
{
/* Output section .text starts at address 3G+1M.
 * Refer to Using LD, the GNU linker: The Location Counter */
  . = 0xC0000000 + 0x100000;
  _text = .;                    /* Text and read-only data */
  .text : {
        *(.text)
        *(.fixup)
        *(.gnu.warning)
        } = 0x9090
/* Unallocated holes filled with 0x9090, i.e. opcode for "NOP NOP".
 * Refer to Using LD, the GNU linker: Optional Section Attributes */

  _etext = .;                   /* End of text section */

  .rodata : { *(.rodata) *(.rodata.*) }
  .kstrtab : { *(.kstrtab) }

/* Aligned to next 16-bytes boundary.
 * Refer to Using LD, the GNU linker: Arithmetic Functions */
  . = ALIGN(16);                /* Exception table */
  __start___ex_table = .;
  __ex_table : { *(__ex_table) }
  __stop___ex_table = .;

  __start___ksymtab = .;        /* Kernel symbol table */
  __ksymtab : { *(__ksymtab) }
  __stop___ksymtab = .;

  .data : {                     /* Data */
        *(.data)
        CONSTRUCTORS
        }
/* For "CONSTRUCTORS", refer to
 * Using LD, the GNU linker: Option Commands */

  _edata = .;                   /* End of data section */

  . = ALIGN(8192);              /* init_task */
  .data.init_task : { *(.data.init_task) }

  . = ALIGN(4096);              /* Init code and data */
  __init_begin = .;
  .text.init : { *(.text.init) }
  .data.init : { *(.data.init) }
  . = ALIGN(16);
  __setup_start = .;
  .setup.init : { *(.setup.init) }
  __setup_end = .;
  __initcall_start = .;
  .initcall.init : { *(.initcall.init) }
  __initcall_end = .;
  . = ALIGN(4096);
  __init_end = .;

  . = ALIGN(4096);
  .data.page_aligned : { *(.data.idt) }

  . = ALIGN(32);
  .data.cacheline_aligned : { *(.data.cacheline_aligned) }

  __bss_start = .;              /* BSS */
  .bss : {
        *(.bss)
        }
  _end = . ;

/* Output section /DISCARD/ will not be included in the final link output.
 * Refer to Using LD, the GNU linker: Section Definitions */
  /* Sections to be discarded */
  /DISCARD/ : {
        *(.text.exit)
        *(.data.exit)
        *(.exitcall.exit)
        }

/* The following output sections are addressed at memory location 0.
 * Refer to Using LD, the GNU linker: Optional Section Attributes */
  /* Stabs debugging sections.  */
  .stab 0 : { *(.stab) }
  .stabstr 0 : { *(.stabstr) }
  .stab.excl 0 : { *(.stab.excl) }
  .stab.exclstr 0 : { *(.stab.exclstr) }
  .stab.index 0 : { *(.stab.index) }
  .stab.indexstr 0 : { *(.stab.indexstr) }
  .comment 0 : { *(.comment) }
}


2.3. linux/arch/i386/Makefile

linux/arch/i386/Makefile is included by linux/Makefile to provide i386 specific items and terms.

All the following targets depend on target vmlinux of linux/Makefile. They are accomplished by making corresponding targets in linux/arch/i386/boot/Makefile with some options.

Table 1. Targets in linux/arch/i386/Makefile

TargetCommand
zImage [a] @$(MAKE) -C arch/i386/boot zImage [b]
bzImage@$(MAKE) -C arch/i386/boot bzImage
zlilo @$(MAKE) -C arch/i386/boot BOOTIMAGE=zImage zlilo
bzlilo @$(MAKE) -C arch/i386/boot BOOTIMAGE=bzImage zlilo
zdisk @$(MAKE) -C arch/i386/boot BOOTIMAGE=zImage zdisk
bzdisk @$(MAKE) -C arch/i386/boot BOOTIMAGE=bzImage zdisk
install @$(MAKE) -C arch/i386/boot BOOTIMAGE=bzImage install
Notes:
a. zImage alias: compressed;
b. "-C" is a MAKE command line option to change directory before reading makefiles;
Refer to GNU make: Summary of Options and GNU make: Recursive Use of make.

It is worth noticing that this makefile redefines some environment variables which are exported by linux/Makefile, specifically:
OBJCOPY=$(CROSS_COMPILE)objcopy -O binary -R .note -R .comment -S
The effect will be passed to subdirectory makefiles and will change the tool's behavior. Refer to GNU Binary Utilities: objcopy for objcopy command line option details.

Not sure why $(LIBS) includes "$(TOPDIR)/arch/i386/lib/lib.a" twice:
LIBS := $(TOPDIR)/arch/i386/lib/lib.a $(LIBS) $(TOPDIR)/arch/i386/lib/lib.a
It may be employed to work around linking problems with some toolchains.


2.4. linux/arch/i386/boot/Makefile

linux/arch/i386/boot/Makefile is somehow independent as it is not included by either linux/arch/i386/Makefile or linux/Makefile.

However, they do have some relationship:

  • linux/Makefile: provides resident kernel image linux/vmlinux;

  • linux/arch/i386/boot/Makefile: provides bootstrap;

  • linux/arch/i386/Makefile: makes sure linux/vmlinux is ready before the bootstrap is constructed, and exports targets (like bzImage) to linux/Makefile.

$(BOOTIMAGE) value, which is for target zdisk, zlilo or zdisk, comes from linux/arch/i386/Makefile.

Table 2. Targets in linux/arch/i386/boot/Makefile

TargetCommand
zImage
$(OBJCOPY) compressed/vmlinux compressed/vmlinux.out
tools/build bootsect setup compressed/vmlinux.out $(ROOT_DEV) > zImage
bzImage
$(OBJCOPY) compressed/bvmlinux compressed/bvmlinux.out
tools/build -b bbootsect bsetup compressed/bvmlinux.out $(ROOT_DEV) \
        > bzImage
zdisk
dd bs=8192 if=$(BOOTIMAGE) of=/dev/fd0
zlilo
if [ -f $(INSTALL_PATH)/vmlinuz ]; then mv $(INSTALL_PATH)/vmlinuz
        $(INSTALL_PATH)/vmlinuz.old; fi
if [ -f $(INSTALL_PATH)/System.map ]; then mv $(INSTALL_PATH)/System.map
        $(INSTALL_PATH)/System.old; fi
cat $(BOOTIMAGE) > $(INSTALL_PATH)/vmlinuz
cp $(TOPDIR)/System.map $(INSTALL_PATH)/
if [ -x /sbin/lilo ]; then /sbin/lilo; else /etc/lilo/install; fi
install
sh -x ./install.sh $(KERNELRELEASE) $(BOOTIMAGE) $(TOPDIR)/System.map
        "$(INSTALL_PATH)"
tools/build builds boot image zImage from {bootsect, setup, compressed/vmlinux.out}, or bzImage from {bbootsect, bsetup, compressed/bvmlinux,out}. linux/Makefile "export ROOT_DEV = CURRENT". Note that $(OBJCOPY) has been redefined by linux/arch/i386/Makefile in Section 2.3.

Table 3. Supporting targets in linux/arch/i386/boot/Makefile

Target: PrerequisitesCommand
compressed/vmlinux: linux/vmlinux@$(MAKE) -C compressed vmlinux
compressed/bvmlinux: linux/vmlinux@$(MAKE) -C compressed bvmlinux
tools/build: tools/build.c $(HOSTCC) $(HOSTCFLAGS) -o $@ $< -I$(TOPDIR)/include [a]
bootsect: bootsect.o $(LD) -Ttext 0x0 -s --oformat binary bootsect.o [b]
bootsect.o: bootsect.s$(AS) -o $@ $<
bootsect.s: bootsect.S ... $(CPP) $(CPPFLAGS) -traditional $(SVGA_MODE) $(RAMDISK) $< -o $@
bbootsect: bbootsect.o $(LD) -Ttext 0x0 -s --oformat binary $< -o $@
bbootsect.o: bbootsect.s$(AS) -o $@ $<
bbootsect.s: bootsect.S ... $(CPP) $(CPPFLAGS) -D__BIG_KERNEL__ -traditional $(SVGA_MODE) $(RAMDISK) $< -o $@
setup: setup.o $(LD) -Ttext 0x0 -s --oformat binary -e begtext -o $@ $<
setup.o: setup.s$(AS) -o $@ $<
setup.s: setup.S video.S ... $(CPP) $(CPPFLAGS) -D__ASSEMBLY__ -traditional $(SVGA_MODE) $(RAMDISK) $< -o $@
bsetup: bsetup.o $(LD) -Ttext 0x0 -s --oformat binary -e begtext -o $@ $<
bsetup.o: bsetup.s$(AS) -o $@ $<
bsetup.s: setup.S video.S ... $(CPP) $(CPPFLAGS) -D__BIG_KERNEL__ -D__ASSEMBLY__ -traditional $(SVGA_MODE) $(RAMDISK) $< -o $@
Notes:
a. "$@" means target, "$<" means first prerequisite; Refer to GNU make: Automatic Variables;
b. "--oformat binary" asks for raw binary output, which is identical to the memory dump of the executable; Refer to Using LD, the GNU linker: Command Line Options.
Note that it has "-D__BIG_KERNEL__" when compile bootsect.S to bbootsect.s, and setup.S to bsetup.s. They must be Place Independent Code (PIC), thus what "-Ttext" option is doesn't matter.


2.5. linux/arch/i386/boot/compressed/Makefile

This makefile handles image (de)compression mechanism.

It is good to separate (de)compression from bootstrap. This divide-and-conquer solution allows us to easily improve (de)compression mechanism or to adopt a new bootstrap method.

Directory linux/arch/i386/boot/compressed/ contains two source files: head.S and misc.c.

Table 4. Targets in linux/arch/i386/boot/compressed/Makefile

TargetCommand
vmlinux[a] $(LD) -Ttext 0x1000 -e startup_32 -o vmlinux head.o misc.o piggy.o
bvmlinux $(LD) -Ttext 0x100000 -e startup_32 -o bvmlinux head.o misc.o piggy.o
head.o $(CC) $(AFLAGS) -traditional -c head.S
misc.o
$(CC) $(CFLAGS) -DKBUILD_BASENAME=$(subst $(comma),_,$(subst -,_,$(*F)))
        -c misc.c[b]
piggy.o
tmppiggy=_tmp_$$$$piggy; \
rm -f $$tmppiggy $$tmppiggy.gz $$tmppiggy.lnk; \
$(OBJCOPY) $(SYSTEM) $$tmppiggy; \
gzip -f -9 < $$tmppiggy > $$tmppiggy.gz; \
echo "SECTIONS { .data : { input_len = .; \
        LONG(input_data_end - input_data) input_data = .; \
        *(.data) input_data_end = .; }}" > $$tmppiggy.lnk; \
$(LD) -r -o piggy.o -b binary $$tmppiggy.gz -b elf32-i386 \
        -T $$tmppiggy.lnk; \
rm -f $$tmppiggy $$tmppiggy.gz $$tmppiggy.lnk
Notes:
a. Target vmlinux here is different from that defined in linux/Makefile;
b. "subst" is a MAKE function; Refer to GNU make: Functions for String Substitution and Analysis.

piggy.o contains variable input_len and gzipped linux/vmlinux. input_len is at the beginning of piggy.o, and it is equal to the size of piggy.o excluding input_len itself. Refer to Using LD, the GNU linker: Section Data Expressions for "LONG(expression)" in piggy.o linker script.

To be exact, it is not linux/vmlinux itself (in ELF format) that is gzipped but its binary image, which is generated by objcopy command. Note that $(OBJCOPY) has been redefined by linux/arch/i386/Makefile in Section 2.3 to output raw binary using "-O binary" option.

When linking {bootsect, setup} or {bbootsect, bsetup}, $(LD) specifies "--oformat binary" option to output them in binary format. When making zImage (or bzImage), $(OBJCOPY) generates an intermediate binary output from compressed/vmlinux (or compressed/bvmlinux) too. It is vital that all components in zImage or bzImage are in raw binary format, so that the image can run by itself without asking a loader to load and relocate it.

Both vmlinux and bvmlinux prepend head.o and misc.o before piggy.o, but they are linked against different start addresses (0x1000 vs 0x100000).


2.6. linux/arch/i386/tools/build.c

linux/arch/i386/tools/build.c is a host utility to generate zImage or bzImage.

In linux/arch/i386/boot/Makefile:
tools/build bootsect setup compressed/vmlinux.out $(ROOT_DEV) > zImage

tools/build -b bbootsect bsetup compressed/bvmlinux.out $(ROOT_DEV) > bzImage
"-b" means is_big_kernel, used to check whether system image is too big.

tools/build outputs the following components to stdout, which is redirected to zImage or bzImage:

  1. bootsect or bbootsect: from linux/arch/i386/boot/bootsect.S, 512 bytes;

  2. setup or bsetup: from linux/arch/i386/boot/setup.S, 4 sectors or more, sector aligned;

  3. compressed/vmlinux.out or compressed/bvmlinux.out, including:

    1. head.o: from linux/arch/i386/boot/compressed/head.S;

    2. misc.o: from linux/arch/i386/boot/compressed/misc.c;

    3. piggy.o: from input_len and gzipped linux/vmlinux.

tools/build will change some contents of bootsect or bbootsect when outputting to stdout:

Table 5. Modification made by tools/build

OffsetByteVariableComment
1F1 (497)1setup_sectorsnumber of setup sectors, >=4
1F4 (500)2sys_sizesystem size in 16-bytes, little-endian
1FC (508)1minor_rootroot dev minor
1FD (509)1major_rootroot dev major

In the following chapters, compressed/vmlinux will be referred as vmlinux and compressed/bvmlinux as bvmlinux, if not confusing.


2.7. Reference


3. linux/arch/i386/boot/bootsect.S

Given that we are booting up bzImage, which is composed of bbootsect, bsetup and bvmlinux (head.o, misc.o, piggy.o), the first floppy sector, bbootsect (512 bytes), which is compiled from linux/arch/i386/boot/bootsect.S, is loaded by BIOS to 07C0:0. The reset of bzImage (bsetup and bvmlinux) has not been loaded yet.


3.1. Move Bootsect

SETUPSECTS      = 4                     /* default nr of setup-sectors */
BOOTSEG         = 0x07C0                /* original address of boot-sector */
INITSEG         = DEF_INITSEG  (0x9000) /* we move boot here - out of the way */
SETUPSEG        = DEF_SETUPSEG (0x9020) /* setup starts here */
SYSSEG          = DEF_SYSSEG   (0x1000) /* system loaded at 0x10000 (65536) */
SYSSIZE         = DEF_SYSSIZE  (0x7F00) /* system size: # of 16-byte clicks */
                                        /* to be loaded */
ROOT_DEV        = 0                     /* ROOT_DEV is now written by "build" */
SWAP_DEV        = 0                     /* SWAP_DEV is now written by "build" */

.code16
.text

///////////////////////////////////////////////////////////////////////////////
_start:
{
        // move ourself from 0x7C00 to 0x90000 and jump there.
        move BOOTSEG:0 to INITSEG:0 (512 bytes);
        goto INITSEG:go;
}
bbootsect has been moved to INITSEG:0 (0x9000:0). Now we can forget BOOTSEG.


3.2. Get Disk Parameters

///////////////////////////////////////////////////////////////////////////////
// prepare stack and disk parameter table
go:
{
        SS:SP = INITSEG:3FF4;   // put stack at INITSEG:0x4000-12
        /* 0x4000 is an arbitrary value >=
         *   length of bootsect + length of setup + room for stack;
         * 12 is disk parm size. */
        copy disk parameter (pointer in 0:0078) to INITSEG:3FF4 (12 bytes);
        // int1E: SYSTEM DATA - DISKETTE PARAMETERS
        patch sector count to 36 (offset 4 in parameter table, 1 byte);
        set disk parameter table pointer (0:0078, int1E) to INITSEG:3FF4;
}
Make sure SP is initialized immediately after SS register. The recommended method of modifying SS is to use "lss" instruction according to IA-32 Intel Architecture Software Developer's Manual (Vol.3. Ch.5.8.3. Masking Exceptions and Interrupts When Switching Stacks).

Stack operations, such as push and pop, will be OK now. First 12 bytes of disk parameter have been copied to INITSEG:3FF4.

///////////////////////////////////////////////////////////////////////////////
// get disk drive parameters, specifically number of sectors/track.
        char disksizes[] = {36, 18, 15, 9};
        int sectors;
{
        SI = disksizes;                         // i = 0;
        do {
probe_loop:
                sectors = DS:[SI++];            // sectors = disksizes[i++];
                if (SI>=disksizes+4) break;     // if (i>=4) break;
                int13/AH=02h(AL=1, ES:BX=INITSEG:0200, CX=sectors, DX=0);
                // int13/AH=02h: DISK - READ SECTOR(S) INTO MEMORY
        } while (failed to read sectors);
}
"lodsb" loads a byte from DS:[SI] to AL and increases SI automatically.

The number of sectors per track has been saved in variable sectors.


3.3. Load Setup Code

bsetup (setup_sects sectors) will be loaded right after bbootsect, i.e. SETUPSEG:0. Note that INITSEG:0200==SETUPSEG:0 and setup_sects has been changed by tools/build to match bsetup size in Section 2.6.

///////////////////////////////////////////////////////////////////////////////
got_sectors:
        word sread;             // sectors read for current track
        char setup_sects;       // overwritten by tools/build
{
        print out "Loading";
        /* int10/AH=03h(BH=0): VIDEO - GET CURSOR POSITION AND SIZE
         * int10/AH=13h(AL=1, BH=0, BL=7, CX=9, DH=DL=0, ES:BP=INITSEG:$msg1):
         *   VIDEO - WRITE STRING */

        // load setup-sectors directly after the moved bootblock (at 0x90200).
        SI = &sread;            // using SI to index sread, head and track
        sread = 1;              // the boot sector has already been read

        int13/AH=00h(DL=0);     // reset FDC

        BX = 0x0200;            // read bsetup right after bbootsect (512 bytes)
        do {
next_step:
                /* to prevent cylinder crossing reading,
                 *   calculate how many sectors to read this time */
                uint16 pushw_ax = AX = MIN(sectors-sread, setup_sects);
no_cyl_crossing:
                read_track(AL, ES:BX);          // AX is not modified
                // set ES:BX, sread, head and track for next read_track()
                set_next(AX);
                setup_sects -= pushw_ax;        // rest - for next step
        } while (setup_sects);
}
SI is set to the address of sread to index variables sread, head and track, as they are contiguous in memory. Check Section 3.6 for read_track() and set_next() details.


3.4. Load Compressed Image

bvmlinux (head.o, misc.o, piggy.o) will be loaded at 0x100000, syssize*16 bytes.

///////////////////////////////////////////////////////////////////////////////
// load vmlinux/bvmlinux (head.o, misc.o, piggy.o)
{
        read_it(ES=SYSSEG);
        kill_motor();                           // turn off floppy drive motor
        print_nl();                             // print CR LF
}
Check Section 3.6 for read_it() details. If we are booting up zImage, vmlinux is loaded at 0x10000 (SYSSEG:0).

bzImage (bbootsect, bsetup, bvmlinux) is in the memory as a whole now.


3.5. Go Setup

///////////////////////////////////////////////////////////////////////////////
// check which root-device to use and jump to setup.S
        int root_dev;                           // overwritten by tools/build
{
        if (!root_dev) {
                switch (sectors) {
                case 15: root_dev = 0x0208;     // /dev/ps0 - 1.2Mb
                        break;
                case 18: root_dev = 0x021C;     // /dev/PS0 - 1.44Mb
                        break;
                case 36: root_dev = 0x0220;     // /dev/fd0H2880 - 2.88Mb
                        break;
                default: root_dev = 0x0200;     // /dev/fd0 - auto detect
                        break;
                }
        }

        // jump to the setup-routine loaded directly after the bootblock
        goto SETUPSEG:0;
}
It passes control to bsetup. See linux/arch/i386/boot/setup.S:start in Section 4.


3.6. Read Disk

The following functions are used to load bsetup and bvmlinux from disk. Note that syssize has been changed by tools/build in Section 2.6 too.
sread:  .word 0                         # sectors read of current track
head:   .word 0                         # current head
track:  .word 0                         # current track
///////////////////////////////////////////////////////////////////////////////
// load the system image at address SYSSEG:0
read_it(ES=SYSSEG)
        int syssize;                    /* system size in 16-bytes,
                                         *   overwritten by tools/build */
{
        if (ES & 0x0fff) die;           // not 64KB aligned

        BX = 0;
        for (;;) {
rp_read:
#ifdef __BIG_KERNEL__
                bootsect_helper(ES:BX);
                /* INITSEG:0220==SETUPSEG:0020 is bootsect_kludge,
                 *   which contains pointer SETUPSEG:bootsect_helper().
                 * This function initializes some data structures
                 *   when it is called for the first time,
                 *   and moves SYSSEG:0 to 0x100000, 64KB each time,
                 *   in the following calls.
                 * See Section 3.7. */
#else
                AX = ES - SYSSEG + ( BX >> 4);  // how many 16-bytes read
#endif
                if (AX > syssize) return;       // everything loaded
ok1_read:
                /* Get proper AL (sectors to read) for this time
                 *   to prevent cylinder crossing reading and BX overflow. */
                AX = sectors - sread;
                CX = BX + (AX << 9);            // 1 sector = 2^9 bytes
                if (CX overflow && CX!=0) {     // > 64KB
                        AX = (-BX) >> 9;
                }
ok2_read:
                read_track(AL, ES:BX);
                set_next(AX);
        }
}

///////////////////////////////////////////////////////////////////////////////
// read disk with parameters (sread, track, head)
read_track(AL sectors, ES:BX destination)
{
        for (;;) {
                printf(".");
                // int10/AH=0Eh: VIDEO - TELETYPE OUTPUT

                // set CX, DX according to (sread, track, head)
                DX = track;
                CX = sread + 1;
                CH = DL;

                DX = head;
                DH = DL;
                DX &= 0x0100;

                int13/AH=02h(AL, ES:BX, CX, DX);
                // int13/AH=02h: DISK - READ SECTOR(S) INTO MEMORY
                if (read disk success) return;
                // "addw $8, %sp" is to cancel previous 4 "pushw" operations.
bad_rt:
                print_all();            // print error code, AX, BX, CX and DX
                int13/AH=00h(DL=0);     // reset FDC
        }
}

///////////////////////////////////////////////////////////////////////////////
// set ES:BX, sread, head and track for next read_track()
set_next(AX sectors_read)
{
        CX = AX;                        // sectors read
        AX += sread;
        if (AX==sectors) {
                head = 1 ^ head;        // flap head between 0 and 1
                if (head==0) track++;
ok4_set:
                AX = 0;
        }
ok3_set:
        sread = AX;
        BX += CX && 9;
        if (BX overflow) {              // > 64KB
                ES += 0x1000;
                BX = 0;
        }
set_next_fn:
}


3.7. Bootsect Helper

setup.S:bootsect_helper() is only used by bootsect.S:read_it().

Because bbootsect and bsetup are linked separately, they use offsets relative to their own code/data segments. We have to "call far" (lcall) for bootsect_helper() in different segment, and it must "return far" (lret) then. This results in CS change in calling, which makes CS!=DS, and we have to use segment modifier to specify variables in setup.S.

///////////////////////////////////////////////////////////////////////////////
// called by bootsect loader when loading bzImage
bootsect_helper(ES:BX)
        bootsect_es = 0;                // defined in setup.S
        type_of_loader = 0;             // defined in setup.S
{
        if (!bootsect_es) {             // called for the first time
                type_of_loader = 0x20;  // bootsect-loader, version 0
                AX = ES >> 4;
                *(byte*)(&bootsect_src_base+2) = AH;
                bootsect_es = ES;
                AX = ES - SYSSEG;
                return;
        }
bootsect_second:
        if (!BX) {                      // 64KB full
                // move from SYSSEG:0 to destination, 64KB each time
                int15/AH=87h(CX=0x8000, ES:SI=CS:bootsect_gdt);
                // int15/AH=87h: SYSTEM - COPY EXTENDED MEMORY
                if (failed to copy) {
                        bootsect_panic() {
                                prtstr("INT15 refuses to access high mem, "
                                        "giving up.");
bootsect_panic_loop:            goto bootsect_panic_loop;   // never return
                        }
                }
                ES = bootsect_es;       // reset ES to always point to 0x10000
                *(byte*)(&bootsect_dst_base+2)++;
        }
bootsect_ex:
        // have the number of moved frames (16-bytes) in AX
        AH = *(byte*)(&bootsect_dst_base+2) << 4;
        AL = 0;
}

///////////////////////////////////////////////////////////////////////////////
// data used by bootsect_helper()
bootsect_gdt:
        .word   0, 0, 0, 0
        .word   0, 0, 0, 0

bootsect_src:
        .word   0xffff

bootsect_src_base:
        .byte   0x00, 0x00, 0x01                # base = 0x010000
        .byte   0x93                            # typbyte
        .word   0                               # limit16,base24 =0

bootsect_dst:
        .word   0xffff

bootsect_dst_base:
        .byte   0x00, 0x00, 0x10                # base = 0x100000
        .byte   0x93                            # typbyte
        .word   0                               # limit16,base24 =0
        .word   0, 0, 0, 0                      # BIOS CS
        .word   0, 0, 0, 0                      # BIOS DS

bootsect_es:
        .word   0

bootsect_panic_mess:
        .string "INT15 refuses to access high mem, giving up."
Note that type_of_loader value is changed. It will be referenced in Section 4.3.


3.8. Miscellaneous

The rest are supporting functions, variables and part of "real-mode kernel header". Note that data is in .text segment as code, thus it can be properly initialized when loaded.
///////////////////////////////////////////////////////////////////////////////
// some small functions
print_all();  /* print error code, AX, BX, CX and DX */
print_nl();   /* print CR LF */
print_hex();  /* print the word pointed to by SS:BP in hexadecimal */
kill_motor()  /* turn off floppy drive motor */
{
#if 1
        int13/AH=00h(DL=0);     // reset FDC
#else
        outb(0, 0x3F2);         // outb(val, port)
#endif
}

///////////////////////////////////////////////////////////////////////////////
sectors:        .word 0
disksizes:      .byte 36, 18, 15, 9
msg1:           .byte 13, 10
                .ascii "Loading"

Bootsect trailer, which is a part of "real-mode kernel header", begins at offset 497.
.org 497
setup_sects:    .byte SETUPSECS         // overwritten by tools/build
root_flags:     .word ROOT_RDONLY
syssize:        .word SYSSIZE           // overwritten by tools/build
swap_dev:       .word SWAP_DEV
ram_size:       .word RAMDISK
vid_mode:       .word SVGA_MODE
root_dev:       .word ROOT_DEV          // overwritten by tools/build
boot_flag:      .word 0xAA55

This "header" must conform to the layout pattern in linux/Documentation/i386/boot.txt:
Offset  Proto   Name            Meaning
/Size
01F1/1  ALL     setup_sects     The size of the setup in sectors
01F2/2  ALL     root_flags      If set, the root is mounted readonly
01F4/2  ALL     syssize         DO NOT USE - for bootsect.S use only
01F6/2  ALL     swap_dev        DO NOT USE - obsolete
01F8/2  ALL     ram_size        DO NOT USE - for bootsect.S use only
01FA/2  ALL     vid_mode        Video mode control
01FC/2  ALL     root_dev        Default root device number
01FE/2  ALL     boot_flag       0xAA55 magic number


3.9. Reference

As <IA-32 Intel Architecture Software Developer's Manual> is widely referenced in this document, I will call it "IA-32 Manual" for short.


4. linux/arch/i386/boot/setup.S

setup.S is responsible for getting the system data from the BIOS and putting them into appropriate places in system memory.

Other boot loaders, like GNU GRUB and LILO, can load bzImage too. Such boot loaders should load bzImage into memory and setup "real-mode kernel header", esp. type_of_loader, then pass control to bsetup directly. setup.S assumes:

  • bsetup or setup may not be loaded at SETUPSEG:0, i.e. CS may not be equal to SETUPSEG when control is passed to setup.S;

  • The first 4 sectors of setup are loaded right after bootsect. The reset may be loaded at SYSSEG:0, preceding vmlinux; This assumption does not apply to bsetup.


4.1. Header

/* Signature words to ensure LILO loaded us right */
#define SIG1    0xAA55
#define SIG2    0x5A5A

INITSEG  = DEF_INITSEG          # 0x9000, we move boot here, out of the way
SYSSEG   = DEF_SYSSEG           # 0x1000, system loaded at 0x10000 (65536).
SETUPSEG = DEF_SETUPSEG         # 0x9020, this is the current segment
                                # ... and the former contents of CS

DELTA_INITSEG = SETUPSEG - INITSEG      # 0x0020

.code16
.text

///////////////////////////////////////////////////////////////////////////////
start:
{
        goto trampoline();              // skip the following header
}

# This is the setup header, and it must start at %cs:2 (old 0x9020:2)
                .ascii  "HdrS"          # header signature
                .word   0x0203          # header version number (>= 0x0105)
                                        # or else old loadlin-1.5 will fail)
realmode_swtch: .word   0, 0            # default_switch, SETUPSEG
start_sys_seg:  .word   SYSSEG
                .word   kernel_version  # pointing to kernel version string
                                        # above section of header is compatible
                                        # with loadlin-1.5 (header v1.5). Don't
                                        # change it.
// kernel_version defined below
type_of_loader: .byte   0               # = 0, old one (LILO, Loadlin,
                                        #      Bootlin, SYSLX, bootsect...)
                                        # See Documentation/i386/boot.txt for
                                        # assigned ids
# flags, unused bits must be zero (RFU) bit within loadflags
loadflags:
LOADED_HIGH     = 1                     # If set, the kernel is loaded high
CAN_USE_HEAP    = 0x80                  # If set, the loader also has set
                                        # heap_end_ptr to tell how much
                                        # space behind setup.S can be used for
                                        # heap purposes.
                                        # Only the loader knows what is free
#ifndef __BIG_KERNEL__
                .byte   0
#else
                .byte   LOADED_HIGH
#endif
setup_move_size: .word  0x8000          # size to move, when setup is not
                                        # loaded at 0x90000. We will move setup
                                        # to 0x90000 then just before jumping
                                        # into the kernel. However, only the
                                        # loader knows how much data behind
                                        # us also needs to be loaded.
code32_start:                           # here loaders can put a different
                                        # start address for 32-bit code.
#ifndef __BIG_KERNEL__
                .long   0x1000          #   0x1000 = default for zImage
#else
                .long   0x100000        # 0x100000 = default for big kernel
#endif
ramdisk_image:  .long   0               # address of loaded ramdisk image
                                        # Here the loader puts the 32-bit
                                        # address where it loaded the image.
                                        # This only will be read by the kernel.
ramdisk_size:   .long   0               # its size in bytes
bootsect_kludge:
                .word  bootsect_helper, SETUPSEG
heap_end_ptr:   .word   modelist+1024   # (Header version 0x0201 or later)
                                        # space from here (exclusive) down to
                                        # end of setup code can be used by setup
                                        # for local heap purposes.
// modelist is at the end of .text section
pad1:           .word   0
cmd_line_ptr:   .long 0                 # (Header version 0x0202 or later)
                                        # If nonzero, a 32-bit pointer
                                        # to the kernel command line.
                                        # The command line should be
                                        # located between the start of
                                        # setup and the end of low
                                        # memory (0xa0000), or it may
                                        # get overwritten before it
                                        # gets read.  If this field is
                                        # used, there is no longer
                                        # anything magical about the
                                        # 0x90000 segment; the setup
                                        # can be located anywhere in
                                        # low memory 0x10000 or higher.
ramdisk_max:    .long __MAXMEM-1        # (Header version 0x0203 or later)
                                        # The highest safe address for
                                        # the contents of an initrd

The __MAXMEM definition in linux/asm-i386/page.h:
/*
 * A __PAGE_OFFSET of 0xC0000000 means that the kernel has
 * a virtual address space of one gigabyte, which limits the
 * amount of physical memory you can use to about 950MB.
 */
#define __PAGE_OFFSET           (0xC0000000)

/*
 * This much address space is reserved for vmalloc() and iomap()
 * as well as fixmap mappings.
 */
#define __VMALLOC_RESERVE       (128 << 20)

#define __MAXMEM                (-__PAGE_OFFSET-__VMALLOC_RESERVE)
It gives __MAXMEM = 1G - 128M.

The setup header must follow some layout pattern. Refer to linux/Documentation/i386/boot.txt:
Offset  Proto   Name            Meaning
/Size
0200/2  2.00+   jump            Jump instruction
0202/4  2.00+   header          Magic signature "HdrS"
0206/2  2.00+   version         Boot protocol version supported
0208/4  2.00+   realmode_swtch  Boot loader hook
020C/2  2.00+   start_sys       The load-low segment (0x1000) (obsolete)
020E/2  2.00+   kernel_version  Pointer to kernel version string
0210/1  2.00+   type_of_loader  Boot loader identifier
0211/1  2.00+   loadflags       Boot protocol option flags
0212/2  2.00+   setup_move_size Move to high memory size (used with hooks)
0214/4  2.00+   code32_start    Boot loader hook
0218/4  2.00+   ramdisk_image   initrd load address (set by boot loader)
021C/4  2.00+   ramdisk_size    initrd size (set by boot loader)
0220/4  2.00+   bootsect_kludge DO NOT USE - for bootsect.S use only
0224/2  2.01+   heap_end_ptr    Free memory after setup end
0226/2  N/A     pad1            Unused
0228/4  2.02+   cmd_line_ptr    32-bit pointer to the kernel command line
022C/4  2.03+   initrd_addr_max Highest legal initrd address


4.2. Check Code Integrity

As setup code may not be contiguous, we should check code integrity first.
///////////////////////////////////////////////////////////////////////////////
trampoline()
{
        start_of_setup();       // never return
        .space 1024;
}

///////////////////////////////////////////////////////////////////////////////
// check signature to see if all code loaded
start_of_setup()
{
        // Bootlin depends on this being done early, check bootlin:technic.doc
        int13/AH=15h(AL=0, DL=0x81);
        // int13/AH=15h: DISK - GET DISK TYPE

#ifdef SAFE_RESET_DISK_CONTROLLER
        int13/AH=0(AL=0, DL=0x80);
        // int13/AH=00h: DISK - RESET DISK SYSTEM
#endif

        DS = CS;
        // check signature at end of setup
        if (setup_sig1!=SIG1 || setup_sig2!=SIG2) {
                goto bad_sig;
        }
        goto goodsig1;
}

///////////////////////////////////////////////////////////////////////////////
// some small functions
prtstr();  /* print asciiz string at DS:SI */
prtsp2();  /* print double space */
prtspc();  /* print single space */
prtchr();  /* print ascii in AL */
beep();    /* print CTRL-G, i.e. beep */
Signature is checked to verify code integrity.

If signature is not found, the rest setup code may precede vmlinux at SYSSEG:0.
no_sig_mess: .string "No setup signature found ..."

goodsig1:
        goto goodsig;                           // make near jump

///////////////////////////////////////////////////////////////////////////////
// move the rest setup code from SYSSEG:0 to CS:0800
bad_sig()
        DELTA_INITSEG = 0x0020 (= SETUPSEG - INITSEG)
        SYSSEG = 0x1000
        word start_sys_seg = SYSSEG;            // defined in setup header
{
        DS = CS - DELTA_INITSEG;                // aka INITSEG
        BX = (byte)(DS:[497]);                  // i.e. setup_sects

        // first 4 sectors already loaded
        CX = (BX - 4) << 8;                     // rest code in word (2-bytes)
        start_sys_seg = (CX >> 3) + SYSSEG;     // real system code start
        move SYSSEG:0 to CS:0800 (CX*2 bytes);

        if (setup_sig1!=SIG1 || setup_sig2!=SIG2) {
no_sig:
                prtstr("No setup signature found ...");
no_sig_loop:
                hlt;
                goto no_sig_loop;
        }
}
"hlt" instruction stops instruction execution and places the processor in halt state. The processor generates a special bus cycle to indicate that halt mode has been entered. When an enabled interrupt (including NMI) is issued, the processor will resume execution after the "hlt" instruction, and the instruction pointer (CS:EIP), pointing to the instruction following the "hlt", will be saved to stack before the interrupt handler is called. Thus we need a "jmp" instruction after the "hlt" to put the processor back to halt state again.

The setup code has been moved to correct place. Variable start_sys_seg points to where real system code starts. If "bad_sig" does not happen, start_sys_seg remains SYSSEG.


4.3. Check Loader Type

Check if the loader is compatible with the image.
///////////////////////////////////////////////////////////////////////////////
good_sig()
        char loadflags;                 // in setup header
        char type_of_loader;            // in setup header
        LOADHIGH = 1
{
        DS = CS - DELTA_INITSEG;        // aka INITSEG
        if ( (loadflags & LOADHIGH) && !type_of_loader ) {
                // Nope, old loader tries to load big-kernel
                prtstr("Wrong loader, giving up...");
                goto no_sig_loop;       // defined above in bad_sig()
        }
}

loader_panic_mess: .string "Wrong loader, giving up..."
Note that type_of_loader has been changed to 0x20 by bootsect_helper() when it loads bvmlinux.


4.4. Get Memory Size

Try three different memory detection schemes to get the extended memory size (above 1M) in KB.

First, try e820h, which lets us assemble a memory map; then try e801h, which returns a 32-bit memory size; and finally 88h, which returns 0-64M.
///////////////////////////////////////////////////////////////////////////////
// get memory size
loader_ok()
        E820NR  = 0x1E8
        E820MAP = 0x2D0
{
        // when entering this function, DS = CS-DELTA_INITSEG, aka INITSEG
        (long)DS:[0x1E0] = 0;

#ifndef STANDARD_MEMORY_BIOS_CALL
        (byte)DS:[0x1E8] = 0;                   // E820NR

        /* method E820H: see ACPI spec
         * the memory map from hell.  e820h returns memory classified into
         * a whole bunch of different types, and allows memory holes and
         * everything.  We scan through this memory map and build a list
         * of the first 32 memory areas, which we return at [E820MAP]. */
meme820:
        EBX = 0;
        DI = 0x02D0;                            // E820MAP
        do {
jmpe820:
                int15/EAX=E820h(EDX='SMAP', EBX, ECX=20, ES:DI=DS:DI);
                // int15/AX=E820h: GET SYSTEM MEMORY MAP
                if (failed || 'SMAP'!=EAX) break;
                // if (1!=DS:[DI+16]) continue; // not usable
good820:
                if (DS:[1E8]>=32) break;        // entry# > E820MAX
                DS:[0x1E8]++;                   // entry# ++;
                DI += 20;                       // adjust buffer for next
again820:
        } while (!EBX)                          // not finished
bail820:

        /* method E801H:
         * memory size is in 1k chunksizes, to avoid confusing loadlin.
         * we store the 0xe801 memory size in a completely different place,
         * because it will most likely be longer than 16 bits.
         * (use 1e0 because that's what Larry Augustine uses in his
         * alternative new memory detection scheme, and it's sensible
         * to write everything into the same place.) */
meme801:
        stc;            // to work around buggy BIOSes
        CX = DX = 0;
        int15/AX=E801h;
        /* int15/AX=E801h: GET MEMORY SIZE FOR >64M CONFIGURATIONS
         *   AX = extended memory between 1M and 16M, in K (max 3C00 = 15MB)
         *   BX = extended memory above 16M, in 64K blocks
         *   CX = configured memory 1M to 16M, in K
         *   DX = configured memory above 16M, in 64K blocks */
        if (failed) goto mem88;
        if (!CX && !DX) {
                CX = AX;
                DX = BX;
        }
e801usecxdx:
        (long)DS:[0x1E0] = ((EDX & 0xFFFF) << 6) + (ECX & 0xFFFF);      // in K
#endif

mem88:  // old traditional method
        int15/AH=88h;
        /* int15/AH=88h: SYSTEM - GET EXTENDED MEMORY SIZE
         *   AX = number of contiguous KB starting at absolute address 100000h */
        DS:[2] = AX;
}


4.5. Hardware Support

Check hardware support, like keyboard, video adapter, hard disk, MCA bus and pointing device.
{
        // set the keyboard repeat rate to the max
        int16/AX=0305h(BX=0);
        // int16/AH=03h: KEYBOARD - SET TYPEMATIC RATE AND DELAY

        /* Check for video adapter and its parameters and
         *   allow the user to browse video modes. */
        video();                        // see video.S

        // get hd0 and hd1 data
        copy hd0 data (*int41) to CS-DELTA_INITSEG:0080 (16 bytes);
        // int41: SYSTEM DATA - HARD DISK 0 PARAMETER TABLE ADDRESS
        copy hd1 data (*int46) to CS-DELTA_INITSEG:0090 (16 bytes);
        // int46: SYSTEM DATA - HARD DISK 1 PARAMETER TABLE ADDRESS
        // check if hd1 exists
        int13/AH=15h(AL=0, DL=0x81);
        // int13/AH=15h: DISK - GET DISK TYPE
        if (failed || AH!=03h) {        // AH==03h if it is a hard disk
no_disk1:
                clear CS-DELTA_INITSEG:0090 (16 bytes);
        }
is_disk1:

        // check for Micro Channel (MCA) bus
        CS-DELTA_INITSEG:[0xA0] = 0;    // set table length to 0
        int15/AH=C0h;
        /* int15/AH=C0h: SYSTEM - GET CONFIGURATION
         *   ES:BX = ROM configuration table */
        if (failed) goto no_mca;
        move ROM configuration table (ES:BX) to CS-DELTA_INITSEG:00A0;
        // CX = (table length<14)? CX:16;    first 16 bytes only
no_mca:

        // check for PS/2 pointing device
        CS-DELTA_INITSEG:[0x1FF] = 0;   // default is no pointing device
        int11h();
        // int11h: BIOS - GET EQUIPMENT LIST
        if (AL & 0x04) {                // mouse installed
                DS:[0x1FF] = 0xAA;
        }
}


4.6. APM Support

Check BIOS APM support.
#if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE)
{
        DS:[0x40] = 0;                  // version = 0 means no APM BIOS
        int15/AX=5300h(BX=0);
        // int15/AX=5300h: Advanced Power Management v1.0+ - INSTALLATION CHECK
        if (failed || 'PM'!=BX || !(CX & 0x02)) goto done_apm_bios;
        // (CX & 0x02) means 32 bit is supported
        int15/AX=5304h(BX=0);
        // int15/AX=5304h: Advanced Power Management v1.0+ - DISCONNECT INTERFACE
        EBX = CX = DX = ESI = DI = 0;
        int15/AX=5303h(BX=0);
        /* int15/AX=5303h: Advanced Power Management v1.0+
         *   - CONNECT 32-BIT PROTMODE INTERFACE */
        if (failed) {
no_32_apm_bios:                         // I moved label no_32_apm_bios here
                DS:[0x4C] &= ~0x0002;   // remove 32 bit support bit
                goto done_apm_bios;
        }
        DS:[0x42] = AX, 32-bit code segment base address;
        DS:[0x44] = EBX, offset of entry point;
        DS:[0x48] = CX, 16-bit code segment base address;
        DS:[0x4A] = DX, 16-bit data segment base address;
        DS:[0x4E] = ESI, APM BIOS code segment length;
        DS:[0x52] = DI, APM BIOS data segment length;
        int15/AX=5300h(BX=0);           // check again
        // int15/AX=5300h: Advanced Power Management v1.0+ - INSTALLATION CHECK
        if (success &&  'PM'==BX) {
                DS:[0x40] = AX, APM version;
                DS:[0x4C] = CX, APM flags;
        } else {
apm_disconnect:
                int15/AX=5304h(BX=0);
                /* int15/AX=5304h: Advanced Power Management v1.0+
                 *   - DISCONNECT INTERFACE */
        }
done_apm_bios:
}
#endif


4.7. Prepare for Protected Mode

// call mode switch
{
        if (realmode_swtch) {
                realmode_swtch();               // mode switch hook
        } else {
rmodeswtch_normal:
                default_switch() {
                        cli;                    // no interrupts allowed
                        outb(0x80, 0x70);       // disable NMI
                }
        }
rmodeswtch_end:
}

// relocate code if necessary
{
        (long)code32 = code32_start;
        if (!(loadflags & LOADED_HIGH)) {       // low loaded zImage
                // 0x0100 <= start_sys_seg < CS-DELTA_INITSEG
do_move0:
                AX = 0x100;
                BP = CS - DELTA_INITSEG;        // aka INITSEG
                BX = start_sys_seg;
do_move:
                move system image from (start_sys_seg:0 .. CS-DELTA_INITSEG:0)
                        to 0100:0;              // move 0x1000 bytes each time
        }
end_move:
Note that code32_start is initialized to 0x1000 for zImage, or 0x100000 for bzImage. The code32 value will be used in passing control to linux/arch/i386/boot/compressed/head.S in Section 4.9. If we boot up zImage, it relocates vmlinux to 0100:0; If we boot up bzImage, bvmlinux remains at start_sys_seg:0. The relocation address must match the "-Ttext" option in linux/arch/i386/boot/compressed/Makefile. See Section 2.5.

Then it will relocate code from CS-DELTA_INITSEG:0 (bbootsect and bsetup) to INITSEG:0, if necessary.
        DS = CS;                // aka SETUPSEG
        // Check whether we need to be downward compatible with version <=201
        if (!cmd_line_ptr && 0x20!=type_of_loader && SETUPSEG!=CS) {
                cli;            // as interrupt may use stack when we are moving
                // store new SS in DX
                AX = CS - DELTA_INITSEG;
                DX = SS;
                if (DX>=AX) {   // stack frame will be moved together
                        DX = DX + INITSEG - AX; // i.e. SS-CS+SETUPSEG
                }
move_self_1:
                /* move CS-DELTA_INITSEG:0 to INITSEG:0 (setup_move_size bytes)
                 *   in two steps in order not to overwrite code on CS:IP
                 * move up (src < dest) but downward ("std") */
                move CS-DELTA_INITSEG:move_self_here+0x200
                  to INITSEG:move_self_here+0x200,
                  setup_move_size-(move_self_here+0x200) bytes;
                // INITSEG:move_self_here+0x200 == SETUPSEG:move_self_here
                goto SETUPSEG:move_self_here;   // CS=SETUPSEG now
move_self_here:
                move CS-DELTA_INITSEG:0 to INITSEG:0,
                  move_self_here+0x200 bytes;   // I mean old CS before goto
                DS = SETUPSEG;
                SS = DX;
        }
end_move_self:
}
Note again, type_of_loader has been changed to 0x20 by bootsect_helper() when it loads bvmlinux.


4.8. Enable A20

For A20 problem and solution, refer to A20 - a pain from the past.
        A20_TEST_LOOPS          =  32   # Iterations per wait
        A20_ENABLE_LOOPS        = 255   # Total loops to try
{
#if defined(CONFIG_MELAN)
        // Enable A20. AMD Elan bug fix.
        outb(0x02, 0x92);               // outb(val, port)
a20_elan_wait:
        while (!a20_test());            // test not passed
        goto a20_done;
#endif

a20_try_loop:
        // First, see if we are on a system with no A20 gate.
a20_none:
        if (a20_test()) goto a20_done;  // test passed

        // Next, try the BIOS (INT 0x15, AX=0x2401)
a20_bios:
        int15/AX=2401h;
        // Int15/AX=2401h: SYSTEM - later PS/2s - ENABLE A20 GATE
        if (a20_test()) goto a20_done;  // test passed

        // Try enabling A20 through the keyboard controller
a20_kbc:
        empty_8042();
        if (a20_test()) goto a20_done;  // test again in case BIOS delayed
        outb(0xD1, 0x64);               // command write
        empty_8042();
        outb(0xDF, 0x60);               // A20 on
        empty_8042();
        // wait until a20 really *is* enabled
a20_kbc_wait:
        CX = 0;
a20_kbc_wait_loop:
        do {
                if (a20_test()) goto a20_done;  // test passed
        } while (--CX)

        // Final attempt: use "configuration port A"
        outb((inb(0x92) | 0x02) & 0xFE, 0x92);
        // wait for configuration port A to take effect
a20_fast_wait:
        CX = 0;
a20_fast_wait_loop:
        do {
                if (a20_test()) goto a20_done;  // test passed
        } while (--CX)

        // A20 is still not responding. Try frobbing it again.
        if (--a20_tries) goto a20_try_loop;
        prtstr("linux: fatal error: A20 gate not responding!");
a20_die:
        hlt;
        goto a20_die;
}

a20_tries:
        .byte   A20_ENABLE_LOOPS                // i.e. 255
a20_err_msg:
        .ascii  "linux: fatal error: A20 gate not responding!"
        .byte   13, 10, 0
For I/O port operations, take a look at related reference materials in Section 4.11.


4.9. Switch to Protected Mode

To ensure code compatibility with all 32-bit IA-32 processors, perform the following steps to switch to protected mode:

  1. Prepare GDT with a null descriptor in the first GDT entry, one code segment descriptor and one data segment descriptor;

  2. Disable interrupts, including maskable hardware interrupts and NMI;

  3. Load the base address and limit of the GDT to GDTR register, using "lgdt" instruction;

  4. Set PE flag in CR0 register, using "mov cr0" (Intel 386 and up) or "lmsw" instruction (for compatibility with Intel 286);

  5. Immediately execute a far "jmp" or a far "call" instruction.

The stack can be placed in a normal read/write data segment, so no dedicated descriptor is required.

a20_done:
{
        lidt    idt_48;         // load idt with 0, 0;

        // convert DS:gdt to a linear ptr
        *(long*)(gdt_48+2) = DS << 4 + &gdt;
        lgdt    gdt_48;

        // reset coprocessor
        outb(0, 0xF0);
        delay();
        outb(0, 0xF1);
        delay();

        // reprogram the interrupts
        outb(0xFF, 0xA1);       // mask all interrupts
        delay();
        outb(0xFB, 0x21);       // mask all irq's but irq2 which is cascaded

        // protected mode!
        AX = 1;
        lmsw ax;                // machine status word, bit 0 thru 15 of CR0
                                // only affects PE, MP, EM & TS flags
        goto flush_instr;

flush_instr:
        BX = 0;                                 // flag to indicate a boot
        ESI = (CS - DELTA_INITSEG) << 4;        // pointer to real-mode code
        /* NOTE: For high loaded big kernels we need a
         * jmpi    0x100000,__KERNEL_CS
         *
         * but we yet haven't reloaded the CS register, so the default size
         * of the target offset still is 16 bit.
         * However, using an operand prefix (0x66), the CPU will properly
         * take our 48 bit far pointer. (INTeL 80386 Programmer's Reference
         * Manual, Mixing 16-bit and 32-bit code, page 16-6) */

        // goto __KERNEL_CS:[(uint32*)code32]; */
        .byte   0x66, 0xea
code32: .long   0x1000          // overwritten in Section 4.7
        .word   __KERNEL_CS     // segment 0x10
        // see linux/arch/i386/boot/compressed/head.S:startup_32
}
The far "jmp" instruction (0xea) updates CS register. The contents of the remaining segment registers (DS, SS, ES, FS and GS) should be reloaded later. The operand-size prefix (0x66) is used to enforce "jmp" to be executed upon the 32-bit operand code32. For operand-size prefix details, check IA-32 Manual (Vol.1. Ch.3.6. Operand-size and Address-size Attributes, and Vol.3. Ch.17. Mixing 16-bit and 32-bit Code).

Control is passed to linux/arch/i386/boot/compressed/head.S:startup_32. For zImage, it is at address 0x1000; For bzImage, it is at 0x100000. See Section 5.

ESI points to the memory area of collected system data. It is used to pass parameters from the 16-bit real mode code of the kernel to the 32-bit part. See linux/Documentation/i386/zero-page.txt for details.

For mode switching details, refer to IA-32 Manual Vol.3. (Ch.9.8. Software Initialization for Protected-Mode Operation, Ch.9.9.1. Switching to Protected Mode, and Ch.17.4. Transferring Control Among Mixed-Size Code Segments).


4.10. Miscellaneous

The rest are supporting functions and variables.
/* macros created by linux/Makefile targets:
 *   include/linux/compile.h and include/linux/version.h */
kernel_version: .ascii  UTS_RELEASE
                .ascii  " ("
                .ascii  LINUX_COMPILE_BY
                .ascii  "@"
                .ascii  LINUX_COMPILE_HOST
                .ascii  ") "
                .ascii  UTS_VERSION
                .byte   0

///////////////////////////////////////////////////////////////////////////////
default_switch() { cli; outb(0x80, 0x70); } /* disable interrupts and NMI */
bootsect_helper(ES:BX); /* see Section 3.7 */

///////////////////////////////////////////////////////////////////////////////
a20_test()
{
        FS = 0;
        GS = 0xFFFF;
        CX = A20_TEST_LOOPS;                    // i.e. 32
        AX = FS:[0x200];
        do {
a20_test_wait:
                FS:[0x200] = ++AX;
                delay();
        } while (AX==GS:[0x210] && --CX);
        return (AX!=GS[0x210]);
        // ZF==0 (i.e. NZ/NE, a20_test!=0) means test passed
}

///////////////////////////////////////////////////////////////////////////////
// check that the keyboard command queue is empty
empty_8042()
{
        int timeout = 100000;

        for (;;) {
empty_8042_loop:
                if (!--timeout) return;
                delay();
                inb(0x64, &AL);                 // 8042 status port
                if (AL & 1) {                   // has output
                        delay();
                        inb(0x60, &AL);         // read it
no_output:      } else if (!(AL & 2)) return;   // no input either
        }
}

///////////////////////////////////////////////////////////////////////////////
// read the CMOS clock, return the seconds in AL, used in video.S
gettime()
{
        int1A/AH=02h();
        /* int1A/AH=02h: TIME - GET REAL-TIME CLOCK TIME
         * DH = seconds in BCD */
        AL = DH & 0x0F;
        AH = DH >> 4;
        aad;
}

///////////////////////////////////////////////////////////////////////////////
delay() { outb(AL, 0x80); }                     // needed after doing I/O

// Descriptor table
gdt:
        .word   0, 0, 0, 0                      # dummy
        .word   0, 0, 0, 0                      # unused
        // segment 0x10, __KERNEL_CS
        .word   0xFFFF                          # 4Gb - (0x100000*0x1000 = 4Gb)
        .word   0                               # base address = 0
        .word   0x9A00                          # code read/exec
        .word   0x00CF                          # granularity = 4096, 386
                                                #  (+5th nibble of limit)
        // segment 0x18, __KERNEL_DS
        .word   0xFFFF                          # 4Gb - (0x100000*0x1000 = 4Gb)
        .word   0                               # base address = 0
        .word   0x9200                          # data read/write
        .word   0x00CF                          # granularity = 4096, 386
                                                #  (+5th nibble of limit)
idt_48:
        .word   0                               # idt limit = 0
        .word   0, 0                            # idt base = 0L
/* [gdt_48] should be 0x0800 (2048) to match the comment,
 *   like what Linux 2.2.22 does. */
gdt_48:
        .word   0x8000                          # gdt limit=2048,
                                                #  256 GDT entries
        .word   0, 0                            # gdt base (filled in later)

#include "video.S"

// signature at the end of setup.S:
{
setup_sig1:     .word   SIG1                    // 0xAA55
setup_sig2:     .word   SIG2                    // 0x5A5A
modelist:
}

Video setup and detection code in video.S:
ASK_VGA = 0xFFFD  // defined in linux/include/asm-i386/boot.h
///////////////////////////////////////////////////////////////////////////////
video()
{
        pushw DS;               // use different segments
        FS = DS;
        DS = ES = CS;
        GS = 0;
        cld;
        basic_detect();         // basic adapter type testing (EGA/VGA/MDA/CGA)
#ifdef CONFIG_VIDEO_SELECT
        if (FS:[0x01FA]!=ASK_VGA) {     // user selected video mode
                mode_set();
                if (failed) {
                        prtstr("You passed an undefined mode number.\n");
                        mode_menu();
                }
        } else {
vid2:           mode_menu();
        }
vid1:
#ifdef CONFIG_VIDEO_RETAIN
        restore_screen();               // restore screen contents
#endif /* CONFIG_VIDEO_RETAIN */
#endif /* CONFIG_VIDEO_SELECT */
        mode_params();                  // store mode parameters
        popw ds;                        // restore original DS
}
/* TODO: video() details */


4.11. Reference


5. linux/arch/i386/boot/compressed/head.S

We are in bvmlinux now! With the help of misc.c:decompress_kernel(), we are going to decompress piggy.o to get the resident kernel image linux/vmlinux.

This file is of pure 32-bit startup code. Unlike previous two files, it has no ".code16" statement in the source file. Refer to Using as: Writing 16-bit Code for details.


5.1. Decompress Kernel

The segment base addresses in segment descriptors (which correspond to segment selector __KERNEL_CS and __KERNEL_DS) are equal to 0; therefore, the logical address offset (in segment:offset format) will be equal to its linear address if either of these segment selectors is used. For zImage, CS:EIP is at logical address 10:1000 (linear address 0x1000) now; for bzImage, 10:100000 (linear address 0x100000).

As paging is not enabled, linear address is identical to physical address. Check IA-32 Manual (Vol.1. Ch.3.3. Memory Organization, and Vol.3. Ch.3. Protected-Mode Memory Management) and Linux Device Drivers: Memory Management in Linux for address issue.

It comes from setup.S that BX=0 and ESI=INITSEG<<4.

.text
///////////////////////////////////////////////////////////////////////////////
startup_32()
{
        cld;
        cli;
        DS = ES = FS = GS = __KERNEL_DS;
        SS:ESP = *stack_start;  // end of user_stack[], defined in misc.c
        // all segment registers are reloaded after protected mode is enabled

        // check that A20 really IS enabled
        EAX = 0;
        do {
1:              DS:[0] = ++EAX;
        } while (DS:[0x100000]==EAX);

        EFLAGS = 0;
        clear BSS;                              // from _edata to _end

        struct moveparams mp;                   // subl $16,%esp
        if (!decompress_kernel(&mp, ESI)) {     // return value in AX
                restore ESI from stack;
                EBX = 0;
                goto __KERNEL_CS:100000;
                // see linux/arch/i386/kernel/head.S:startup_32
        }

        /*
         * We come here, if we were loaded high.
         * We need to move the move-in-place routine down to 0x1000
         * and then start it with the buffer addresses in registers,
         * which we got from the stack.
         */
3:      move move_rountine_start..move_routine_end to 0x1000;
        // move_routine_start & move_routine_end are defined below

        // prepare move_routine_start() parameters
        EBX = real mode pointer;        // ESI value passed from setup.S
        ESI = mp.low_buffer_start;
        ECX = mp.lcount;
        EDX = mp.high_buffer_star;
        EAX = mp.hcount;
        EDI = 0x100000;
        cli;                    // make sure we don't get interrupted
        goto __KERNEL_CS:1000;  // move_routine_start();
}

/* Routine (template) for moving the decompressed kernel in place,
 * if we were high loaded. This _must_ PIC-code ! */
///////////////////////////////////////////////////////////////////////////////
move_routine_start()
{
        move mp.low_buffer_start to 0x100000, mp.lcount bytes,
          in two steps: (lcount >> 2) words + (lcount & 3) bytes;
        move/append mp.high_buffer_start, ((mp.hcount + 3) >> 2) words
        // 1 word == 4 bytes, as I mean 32-bit code/data.

        ESI = EBX;              // real mode pointer, as that from setup.S
        EBX = 0;
        goto __KERNEL_CS:100000;
        // see linux/arch/i386/kernel/head.S:startup_32()
move_routine_end:
}
For the meaning of "je 1b" and "jnz 3f", refer to Using as: Local Symbol Names.

Didn't find _edata and _end definitions? No problem, they are defined in the "internal linker script". Without -T (--script=) option specified, ld uses this builtin script to link compressed/bvmlinux. Use "ld --verbose" to display this script, or check Appendix B. Internal Linker Script.

Refer to Using LD, the GNU linker: Command Line Options for -T (--script=), -L (--library-path=) and --verbose option description. "man ld" and "info ld" may help too.

piggy.o has been unzipped and control is passed to __KERNEL_CS:100000, i.e. linux/arch/i386/kernel/head.S:startup_32(). See Section 6.

#define LOW_BUFFER_START      0x2000
#define LOW_BUFFER_MAX       0x90000
#define HEAP_SIZE             0x3000
///////////////////////////////////////////////////////////////////////////////
asmlinkage int decompress_kernel(struct moveparams *mv, void *rmode)
|-- setup real_mode(=rmode), vidmem, vidport, lines and cols;
|-- if (is_zImage) setup_normal_output_buffer() {
|       output_data      = 0x100000;
|       free_mem_end_ptr = real_mode;
|   } else (is_bzImage) setup_output_buffer_if_we_run_high(mv) {
|       output_data      = LOW_BUFFER_START;
|       low_buffer_end   = MIN(real_mode, LOW_BUFFER_MAX) & ~0xfff;
|       low_buffer_size  = low_buffer_end - LOW_BUFFER_START;
|       free_mem_end_ptr = &end + HEAP_SIZE;
|       // get mv->low_buffer_start and mv->high_buffer_start
|       mv->low_buffer_start = LOW_BUFFER_START;
|       /* To make this program work, we must have
|        *   high_buffer_start > &end+HEAP_SIZE;
|        * As we will move low_buffer from LOW_BUFFER_START to 0x100000
|        *   (max low_buffer_size bytes) finally, we should have
|        *   high_buffer_start > 0x100000+low_buffer_size; */
|       mv->high_buffer_start = high_buffer_start
|           = MAX(&end+HEAP_SIZE, 0x100000+low_buffer_size);
|       mv->hcount =  0 if (0x100000+low_buffer_size >  &end+HEAP_SIZE);
|                  = -1 if (0x100000+low_buffer_size <= &end+HEAP_SIZE);
|       /* mv->hcount==0 : we need not move high_buffer later,
|        *   as it is already at 0x100000+low_buffer_size.
|        * Used by close_output_buffer_if_we_run_high() below. */
|   }
|-- makecrc();          // create crc_32_tab[]
|   puts("Uncompressing Linux... ");
|-- gunzip();
|   puts("Ok, booting the kernel.\n");
|-- if (is_bzImage) close_output_buffer_if_we_run_high(mv) {
|       // get mv->lcount and mv->hcount
|       if (bytes_out > low_buffer_size) {
|           mv->lcount = low_buffer_size;
|           if (mv->hcount)
|               mv->hcount = bytes_out - low_buffer_size;
|       } else {
|           mv->lcount = bytes_out;
|           mv->hcount = 0;
|       }
|   }
`-- return is_bzImage;  // return value in AX
end is defined in the "internal linker script" too.

decompress_kernel() has an "asmlinkage" modifer. In linux/include/linux/linkage.h:
#ifdef __cplusplus
#define CPP_ASMLINKAGE extern "C"
#else
#define CPP_ASMLINKAGE
#endif

#if defined __i386__
#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))
#elif defined __ia64__
#define asmlinkage CPP_ASMLINKAGE __attribute__((syscall_linkage))
#else
#define asmlinkage CPP_ASMLINKAGE
#endif
Macro "asmlinkage" will force the compiler to pass all function arguments on the stack, in case some optimization method may try to change this convention. Check Using the GNU Compiler Collection (GCC): Declaring Attributes of Functions (regparm) and Kernelnewbies FAQ: What is asmlinkage for more details.


5.2. gunzip()

decompress_kernel() calls gunzip() -> inflate(), which are defined in linux/lib/inflate.c, to decompress resident kernel image to low buffer (pointed by output_data) and high buffer (pointed by high_buffer_start, for bzImage only).

The gzip file format is specified in RFC 1952.

Table 6. gzip file format

ComponentMeaningByteComment
ID1IDentification 1131 (0x1f, \037)
ID2IDentification 21139 (0x8b, \213) [a]
CMCompression Method18 - denotes the "deflate" compression method
FLGFLaGs10 for most cases
MTIMEModification TIME4modification time of the original file
XFLeXtra FLags12 - compressor used maximum compression, slowest algorithm [b]
OSOperating System13 - Unix
extra fields--variable length, field indicated by FLG [c]
compressed blocks--variable length
CRC32-4CRC value of the uncompressed data
ISIZEInput SIZE4the size of the uncompressed input data modulo 2^32
Notes:
a. ID2 value can be 158 (0x9e, \236) for gzip 0.5;
b. XFL value 4 - compressor used fastest algorithm;
c. FLG bit 0, FTEXT, does not indicate any "extra field".

We can use this file format knowledge to find out the beginning of gzipped linux/vmlinux.
[root@localhost boot]# hexdump -C /boot/vmlinuz-2.4.20-28.9 | grep '1f 8b 08 00'
00004c50  1f 8b 08 00 01 f6 e1 3f  02 03 ec 5d 7d 74 14 55  |.......?...]}t.U|
[root@localhost boot]# hexdump -C /boot/vmlinuz-2.4.20-28.9 -s 0x4c40 -n 64
00004c40  00 80 0b 00 00 fc 21 00  68 00 00 00 1e 01 11 00  |......!.h.......|
00004c50  1f 8b 08 00 01 f6 e1 3f  02 03 ec 5d 7d 74 14 55  |.......?...]}t.U|
00004c60  96 7f d5 a9 d0 1d 4d ac  56 93 35 ac 01 3a 9c 6a  |......M.V.5..:.j|
00004c70  4d 46 5c d3 7b f8 48 36  c9 6c 84 f0 25 88 20 9f  |MF\.{.H6.l..%. .|
00004c80
[root@localhost boot]# hexdump -C /boot/vmlinuz-2.4.20-28.9 | tail -n 4
00114d40  bd 77 66 da ce 6f 3d d6  33 5c 14 a2 9f 7e fa e9  |.wf..o=.3\...~..|
00114d50  a7 9f 7e fa ff 57 3f 00  00 00 00 00 d8 bc ab ea  |..~..W?.........|
00114d60  44 5d 76 d1 fd 03 33 58  c2 f0 00 51 27 00        |D]v...3X...Q'.|
00114d6e
We can see that the gzipped file begins at 0x4c50 in the above example. The four bytes before "1f 8b 08 00" is input_len (0x0011011e, in little endian), and 0x4c50+0x0011011e=0x114d6e equals to the size of bzImage (/boot/vmlinuz-2.4.20-28.9).

static uch *inbuf;           /* input buffer */
static unsigned insize = 0;  /* valid bytes in inbuf */
static unsigned inptr = 0;   /* index of next byte to be processed in inbuf */
///////////////////////////////////////////////////////////////////////////////
static int gunzip(void)
{
        Check input buffer for {ID1, ID2, CM}, must be
                {0x1f, 0x8b, 0x08} (normal case), or
                {0x1f, 0x9e, 0x08} (for gzip 0.5);
        Check FLG (flag byte), must not set bit 1, 5, 6 and 7;
        Ignore {MTIME, XFL, OS};
        Handle optional structures, which correspond to FLG bit 2, 3 and 4;
        inflate();              // handle compressed blocks
        Validate {CRC32, ISIZE};
}
When get_byte(), defined in linux/arch/i386/boot/compressed/misc.c, is called for the first time, it calls fill_inbuf() to setup input buffer inbuf=input_data and insize=input_len. Symbol input_data and input_len are defined in piggy.o linker script. See Section 2.5.


5.3. inflate()

// some important definitions in misc.c
#define WSIZE 0x8000            /* Window size must be at least 32k,
                                 * and a power of two */
static uch window[WSIZE];       /* Sliding window buffer */
static unsigned outcnt = 0;     /* bytes in output buffer */

// linux/lib/inflate.c
#define wp outcnt
#define flush_output(w) (wp=(w),flush_window())
STATIC unsigned long bb;        /* bit buffer */
STATIC unsigned bk;             /* bits in bit buffer */
STATIC unsigned hufts;          /* track memory usage */
static long free_mem_ptr = (long)&end;
///////////////////////////////////////////////////////////////////////////////
STATIC int inflate()
{
        int e;                  /* last block flag */
        int r;                  /* result code */
        unsigned h;             /* maximum struct huft's malloc'ed */
        void *ptr;

        wp = bb = bk = 0;

        // inflate compressed blocks one by one
        do {
                hufts = 0;
                gzip_mark() { ptr = free_mem_ptr; };
                if ((r = inflate_block(&e)) != 0) {
                        gzip_release() { free_mem_ptr = ptr; };
                        return r;
                }
                gzip_release() { free_mem_ptr = ptr; };
                if (hufts > h)
                h = hufts;
        } while (!e);

        /* Undo too much lookahead. The next read will be byte aligned so we
         * can discard unused bits in the last meaningful byte. */
        while (bk >= 8) {
                bk -= 8;
                inptr--;
        }

        /* write the output window window[0..outcnt-1] to output_data,
         * update output_ptr/output_data, crc and bytes_out accordingly, and
         * reset outcnt to 0. */
        flush_output(wp);

        /* return success */
        return 0;
}
free_mem_ptr is used in misc.c:malloc() for dynamic memory allocation. Before inflating each compressed block, gzip_mark() saves the value of free_mem_ptr; After inflation, gzip_release() will restore this value. This is how it "free()" the memory allocated in inflate_block().

Gzip uses Lempel-Ziv coding (LZ77) to compress files. The compressed data format is specified in RFC 1951. inflate_block() will inflate compressed blocks, which can be treated as a bit sequence.

The data structure of each compressed block is outlined below:
BFINAL (1 bit)
    0  - not the last block
    1  - the last block
BTYPE  (2 bits)
    00 - no compression
        remaining bits until the byte boundary;
        LEN      (2 bytes);
        NLEN     (2 bytes, the one's complement of LEN);
        data     (LEN bytes);
    01 - compressed with fixed Huffman codes
        {
        literal  (7-9 bits, represent code 0..287, excluding 256);
                     // See RFC 1951, table in Paragraph 3.2.6.
        length   (0-5 bits if literal > 256, represent length 3..258);
                     // See RFC 1951, 1st alphabet table in Paragraph 3.2.5.
        data     (of literal bytes if literal < 256);
        distance (5 plus 0-13 extra bits if literal == 257..285, represent
                         distance 1..32768);
                     /* See RFC 1951, 2nd alphabet table in Paragraph 3.2.5,
                      *   but statement in Paragraph 3.2.6. */
                     /* Move backward "distance" bytes in the output stream,
                      * and copy "length" bytes */
        }*           // can be of multiple instances
        literal  (7 bits, all 0, literal == 256, means end of block);
    10 - compressed with dynamic Huffman codes
        HLIT     (5 bits, # of Literal/Length codes - 257, 257-286);
        HDIST    (5 bits, # of Distance codes - 1,         1-32);
        HCLEN    (4 bits, # of Code Length codes - 4,      4 - 19);
        Code Length sequence    ((HCLEN+4)*3 bits)
        /* The following two alphabet tables will be decoded using
         *   the Huffman decoding table which is generated from
         *   the preceeding Code Length sequence. */
        Literal/Length alphabet (HLIT+257 codes)
        Distance alphabet       (HDIST+1 codes)
        // Decoding tables will be built from these alphpabet tables.
        /* The following is similar to that of fixed Huffman codes portion,
         *   except that they use different decoding tables. */
        {
        literal/length
                 (variable length, depending on Literal/Length alphabet);
        data     (of literal bytes if literal < 256);
        distance (variable length if literal == 257..285, depending on
                         Distance alphabet);
        }*           // can be of multiple instances
        literal  (literal value 256, which means end of block);
    11 - reserved (error)
Note that data elements are packed into bytes starting from Least-Significant Bit (LSB) to Most-Significant Bit (MSB), while Huffman codes are packed starting with MSB. Also note that literal value 286-287 and distance codes 30-31 will never actually occur.

With the above data structure in mind and RFC 1951 by hand, it is not too hard to understand inflate_block(). Refer to related paragraphs in RFC 1951 for Huffman coding and alphabet table generation.

For more details, refer to linux/lib/inflate.c, gzip source code (many in-line comments) and related reference materials.


6. linux/arch/i386/kernel/head.S

Resident kernel image linux/vmlinux is in place finally! It requires two inputs:

  • ESI, to indicate where the 16-bit real mode code is located, aka INITSEG<<4;

  • BX, to indicate which CPU is running, 0 means BSP, other values for AP.

ESI points to the parameter area from the 16-bit real mode code, which will be copied to empty_zero_page later. ESI is only valid for BSP.

BSP (BootStrap Processor) and APs (Application Processors) are Intel terminologies. Check IA-32 Manual (Vol.3. Ch.7.5. Multiple-Processor (MP) Initialization) and MultiProcessor Specification for MP intialization issue.

From a software point of view, in a multiprocessor system, BSP and APs share the physical memory but use their own register sets. BSP runs the kernel code first, setups OS execution enviornment and triggers APs to run over it too. AP will be sleeping until BSP kicks it.


6.1. Enable Paging

.text
///////////////////////////////////////////////////////////////////////////////
startup_32()
{
        /* set segments to known values */
        cld;
        DS = ES = FS = GS = __KERNEL_DS;

#ifdef CONFIG_SMP
#define cr4_bits mmu_cr4_features-__PAGE_OFFSET
        /* long mmu_cr4_features defined in linux/arch/i386/kernel/setup.c
         * __PAGE_OFFSET = 0xC0000000, i.e. 3G */

        // AP with CR4 support (> Intel 486) will copy CR4 from BSP
        if (BX && cr4_bits) {
                // turn on paging options (PSE, PAE, ...)
                CR4 |= cr4_bits;
        } else
#endif
        {
                /* only BSP initializes page tables (pg0..empty_zero_page-1)
                 *   pg0 at .org 0x2000
                 *   empty_zero_page at .org 0x4000
                 *   total (0x4000-0x2000)/4 = 0x0800 entries */
                pg0 = {
                        0x00000007,             // 7 = PRESENT + RW + USER
                        0x00001007,             // 0x1000 = 4096 = 4K
                        0x00002007,
                        ...
                pg1:    0x00400007,
                        ...
                        0x007FF007              // total 8M
                empty_zero_page:
                };
        }
Why do we have to add "-__PAGE_OFFSET" when referring a kernel symbol, for example, like pg0?

In linux/arch/i386/vmlinux.lds, we have:
  . = 0xC0000000 + 0x100000;
  _text = .;                    /* Text and read-only data */
  .text : {
        *(.text)
...
As pg0 is at offset 0x2000 of section .text in linux/arch/i386/kernel/head.o, which is the first file to be linked for linux/vmlinux, it will be at offset 0x2000 in output section .text. Thus it will be located at address 0xC0000000+0x100000+0x2000 after linking.
[root@localhost boot]# nm --defined /boot/vmlinux-2.4.20-28.9 | grep 'startup_32
\|mmu_cr4_features\|pg0\|\<empty_zero_page\>' | sort
c0100000 t startup_32
c0102000 T pg0
c0104000 T empty_zero_page
c0376404 B mmu_cr4_features
In protected mode without paging enabled, linear address will be mapped directly to physical address. "movl $pg0-__PAGE_OFFSET,%edi" will set EDI=0x102000, which is equal to the physical address of pg0 (as linux/vmlinux is relocated to 0x100000). Without this "-PAGE_OFFSET" scheme, it will access physical address 0xC0102000, which will be wrong and probably beyond RAM space.

mmu_cr4_features is in .bss section and is located at physical address 0x376404 in the above example.

After page tables are initialized, paging can be enabled.
        // set page directory base pointer, physical address
        CR3 = swapper_pg_dir - __PAGE_OFFSET;
        // paging enabled!
        CR0 |= 0x80000000;      // set PG bit
        goto 1f;                // flush prefetch-queue
1:
        EAX = &1f;              // address following the next instruction
        goto