How I create high-quality web disassemblies from hand-crafted code
As discussed in the overview, all of my disassembly websites are generated from a set of source code repositories in GitHub by a monstrous Python script called create-disassembly-websites.py.
Before I say anything else, I must warn you that this Python script is a textbook example of a little script that grew into a big script without being properly refactored on the way. There are global variables - lots of them. There are no classes and there are no objects. The entire beast is one, massive Python file with more than 8000 lines of code. The main routine is gargantuan. Content, configuration and code are freely mixed. There are practically no comments. The text-processing routines are heavily based around regular expressions. The results of all the code analyses are stored in nested lists and dictionaries when an in-memory database would make a lot more sense. There are if-blocks so long you can see the curvature of the Earth in them. And so it goes.
It is, in short, a bit of a mess. But it works, it's relatively easy to follow (outside of the regular expressions), and although it's not much more than a hacky script, it's my hacky script, and I have grown rather fond of it. As a professional coder I know it's not pretty, but what it produces is a work of art, so bear this in mind if you're tempted to dive into the code.
What the script does
--------------------
The create-disassembly-websites.py script has two main functions.
The first function (and the one we'll be talking about here) is to generate source code pages and indexes for the Aviator, Revs and Lander websites. It does this for each site by ingesting a hand-built repository that contains the fully buildable and fully documented source code for the relevant game, and spitting out a website version as a set of HTML files.
These are the source code repositories for the three websites I just mentioned:
The script reads the files from each repository and creates the code pages for the corresponding website, which are then combined with the static web content from the bbcelite-websites repository to create the final website. You can read more about this process in the overview, but here's a flowchart of how this works - in this example, we're processing the Revs website:
revs-source-code-bbc-micro bbcelite-websites | | | | create-disassembly-websites.py | | | | | | | +-------------|---------- revs.bbcelite.com website --------|-------------+ | | | | | v v | | | | Code pages Homepage | | Indexes About site | | Statistics Deep dives | | Version info | | | +-------------------------------------------------------------------------+
In this article we're going to take a look at the left-hand arrow, which converts source code into web pages.
The second function of the script (which we'll look at separately) is to generate content for the more complex Elite website. In this case, not only does the script generate the code pages for nine different versions of Elite, but it also creates the code pages and indexes for the compare section, which contains a line-by-line comparison of all the Acornsoft releases of Elite. The way this works is tied up with the structure of the Elite library repository, and you can read all about it in the articles on generating source code repositories for Elite and generating code comparisons for Elite.
For the rest of this article, we'll concentrate on the simpler process of generating the Aviator, Revs and Lander sites from their respective repositories.
A deeper look at the website generation process
-----------------------------------------------
For our example, let's look at the Revs website, which is generated from the source code repository at revs-source-code-bbc-micro.
The site is updated by running the generate-revs.sh script. If you want to have a go at running this process yourself, then the bbcelite-scripts repository contains step-by-step instructions on setting up and running the scripts yourself. The process has been built on a Mac, but it wouldn't take much effort to get it working on Linux or Windows.
The shell script does three things. First it clears down the folder into which we will generate the website. Then it runs the create-disassembly-websites.py Python script to generate up-to-date website pages from the repository source code and save them into this folder. And finally, it syncs the results to the website itself.
We're interested in the middle step, which is a simple one-liner:
python3 create-disassembly-websites.py revs
The "revs" argument tells the script to generate the Revs site; this value gets put into the args.platform variable, which you'll see scattered throughout the script. Valid values for the platform are: cassette, disc, electron, 6502sp, c64, apple, master, nes, elite-a, aviator, revs and lander (the first nine are for the different versions of Elite).
This kicks off the ingestion process. Here's a summary of what the script does, along with links to relevant examples of the output on the Revs website:
- Call create_folder() to create the folders we need to hold the generated website.
- Read the source code files from the repository into memory.
- For Elite only, print "Analysing files for comparison" and call analyse_files_for_compare() to ingest the source code library and populate global variables such as includes_in_versions{} and all_includes{}.
- Print "Extracting popup data" to the terminal.
- Call extract_popup_data() for each of the source code files to ingest the code and populate global variables such as references_library{}, entry_points{} and configuration_variables{}.
- Print "Writing articles" to the terminal.
- Call output_individual_code_pages() for each of the source code files to create the individual code pages (i.e. one routine per page).
- Call output_map_of_source_code() to create the map of the source code.
- Call output_source_code_stats() to create the source code statistics page.
- Call output_source_code_cross_references() to create the source code cross-reference page.
- Print "Writing large source code pages" to the terminal.
- Call output_large_source_code_page() to create the large source code pages (i.e. continuous source code).
- Print "Writing menus" to the terminal.
- Call output_menus() to create the navigation_revs.php include file containing the generated left-hand navigation.
- Print "Writing indexes" to the terminal.
- Call add_workspace_variables_to_indexes() to merge the workspace variables into variables{} for inclusion in the indexes.
- Call add_entry_points_to_indexes() to merge entry_points{} into subroutines{} for inclusion in the indexes.
- Call output_indexes() to create the indexes by category.
- Call output_a_z_index() to create the A-Z index.
This should be enough information to satisfy most people's curiosity, but if you want to know exactly how the source code sections of my disassembly sites are generated, then your next step is to look at how the script works. If you choose to go down this rabbit hole, here is some more information to guide you on your way.
Global variables
----------------
Here's a list of important global variables that are used throughout the create-disassembly-websites.py script. You will find this useful if you want to poke through the code:
- Populated in extract_popup_data():
- configuration_variables{} = a dictionary of every configuration variable in the source
- references_library{} = a dictionary of every label in the source (i.e. variable, subroutine etc.)
- Populated in extract_popup_data() > parse_header():
- all_headers[] = a list of the data in each header in the source code, in the order in which they appear, to be used for generating the map of the source code
- macro_names[] = a list of macro names from the source
- Populated in extract_popup_data() > parse_header() > add_category():
- categories{} = a dictionary of all categories in the source (taken from the Category headers)
- Populated in extract_popup_data() > parse_header() > add_mentions():
- mentions{} = a dictionary of parent routines for each routine in the source, to be used in generating the "More info" section
- Populated in extract_entry_point() > add_article():
- entry_points{} = a dictionary of entry points and associated data, grouped by category
- Populated in build_individual_code_page() > add_article()
- variables{} = a dictionary containing the contents of every variable header in the source
- subroutines{} = a dictionary containing the contents of every subroutine header in the source
- workspaces{} = a dictionary containing the contents of every workspace header in the source
- macros{} = a dictionary containing the contents of every macro header in the source
- Populated in build_individual_code_page() > add_source_code_stats():
- source_code_stats{} = a dictionary that keeps track of the number of data bytes and instructions in each routine and variable in the source
- Populated in tidy_code():
- references[] = a set that contains all popup references in the current page
The script writes the contents of some of these variables into the following files in the disassembly-website-generator/debug folder: all_headers.txt, entry_points.txt, mentions.txt, references_library.txt, source_code_stats.txt. On top of this, any unparsed text from the top level in the source code files is written into the output_all.txt file.
Call hierarchy
--------------
Here's a call hierarchy of the above processes, which will help you orientate yourself if you want to look through the script. This is not a breakdown of each routine's actions, it's just a list of function usage in the script, so it's more of a map for your own investigations rather than a full explanation.
In the following, a + indicates a routine that is called from multiple places, while a - indicates this subroutine is only called once in the whole program.
Routine | Details |
---|---|
+ create_folder() | Create skeleton website folders |
+ extract_popup_data() | Populate references_library{}, entry_points{}, configuration_variables{} |
+ add_to_references_library() | Add entry to references_library{} |
+ fetch_header_summary() | Extract multi-line summary from header |
- parse_header() | Called once header has been extracted |
+ add_to_references_library() | Add entry to references_library{} |
+ add_category() | Add category to categories{} and create folder |
- extract_entry_point() | Called when we come across an entry point |
- fetch_header_comments() | Fetch multi-line comments from header |
+ add_to_references_library() | Add entry to references_library{} |
+ add_article() | Add entry to entry_points{} |
+ tidy_code() | Add markup to a source code line |
- markup_operand() | Add markup to operands |
- extract_labels() | Called as a catch-all for body content |
+ fetch_comments() | Fetch multi-line comments from code |
+ add_source_code_stats() | Add counts to source_code_stats{} |
+ add_to_references_library() | Add entry to references_library{} |
+ add_mentions() | Add references to mentions{} for source code cross-reference page |
+ output_individual_code_pages() | Create individual code pages, one routine per page |
+ fetch_header_summary() | Extract multi-line summary from header |
- build_individual_code_page() | Called once header has been extracted |
+ add_category() | Add category to categories{} and create folder |
+ add_article() | Add entry to variables{}, workspaces{}, macros{} or subroutines{} |
+ start_code_html() | Output the start of the HTML code page |
- fetch_next_prev() | Fetch correct next/previous array for this page |
+ output_next_prev() | Output next/previous links from fetched array |
+ routine_extra_data() | Create the 'More info' references list to add to headers |
+ tidy_source_header_line() | Add markup to a source code header line |
+ tidy_code() | Add markup to a source code line |
- markup_operand() | Add markup to operands |
+ add_source_code_stats() | Add counts to source_code_stats{} |
+ add_reference_popups() | Output popup HTML for all references in references[] |
+ add_mentions() | Add references to mentions{} for source code cross-reference page |
+ end_code_html() | Output the end of the HTML code page |
- output_map_of_source_code() | Create source code map page |
+ start_html() | Output the start of the HTML page |
+ output_next_prev() | Output next/previous links for page |
+ end_html() | Output the end of the HTML page |
- output_source_code_stats() | Create source code stats page |
+ start_html_index() | Output the start of the HTML index page |
+ output_next_prev() | Output next/previous links for page |
- percentage() | Display a percentage |
- padding() | Pad a number |
+ end_html() | Output the end of the HTML page |
- output_source_code_cross_references() | Create source code cross-reference page |
+ start_html() | Output the start of the HTML page |
+ output_next_prev() | Output next/previous links for page |
+ fetch_cross_references() | Create a list of cross-references for this entry |
+ end_html() | Output the end of the HTML page |
+ output_large_source_code_page() | Create large source code page (e.g. Elite A, Ship blueprints) |
+ start_html() | Output the start of the HTML page |
+ output_next_prev() | Output next/previous links for page |
- large_source_code_page_contents() | Called once for each large source code that's output |
+ fetch_header_summary() | Extract multi-line summary from header |
- build_large_source_code_page() | Called once header has been extracted |
+ routine_extra_data() | Create the 'More info' references list to add to headers |
+ tidy_source_header_line() | Add markup to a source code header line |
+ tidy_code() | Add markup to a source code line |
- markup_operand() | Add markup to operands |
+ fetch_cross_references() | Create a list of cross-references for this entry |
+ tidy_source_header_line() | Add markup to a source code header line |
+ tidy_code() | Add markup to a source code line |
- markup_operand() | Add markup to operands |
+ add_reference_popups() | Output popup HTML for all references in references[] |
+ end_html() | Output the end of the HTML page |
- output_menus() | Create navigation.php |
- add_workspace_variables_to_indexes() | Merge workspace variables into variables{} for inclusion in indexes |
- add_entry_points_to_indexes() | Merge entry_points into subroutines{} for inclusion in indexes |
+ output_indexes() | Create indexes by category |
+ start_html() | Output the start of the HTML page |
+ output_next_prev() | Output next/previous links for page |
+ end_html() | Output the end of the HTML page |
- output_a_z_index() | Create A-Z index |
+ start_html() | Output the start of the HTML page |
+ output_next_prev() | Output next/previous links for page |
+ end_html() | Output the end of the HTML page |
Note that if you do decide to investigate the guts of this script, then you will see a lot of references to "stage". The stage refers to a specific part of the codebase, such as the loader or docked code or text tokens. Typically this refers to an individual stage of the assembly process, hence the name. Stage names are often displayed in brackets after a routine's name, so you might end up with names like "MESS (Docked)" or "DORND (Loader)".
Good luck if you decide to venture in. Here be dragons...