Project 1-1: RISC-V Assembler

Computer Architecture I ShanghaiTech University
Project 1.1 Project 1.2

IMPORTANT INFO - PLEASE READ

The projects are part of your design project worth 2 credit points. As such they run in parallel to the actual course. So be aware that the due date for project and homework might be very close to each other! Start early and do not procrastinate.

Getting started

Make sure you read through the entire specification before starting the project.

The whole project is split into two parts. Project 1.1 and Project 1.2. Those will be autograded separately and have their own deadlines.

You will be using gitlab to collaborate with your group partner. Autolab will use the files from gitlab. Make sure that you have access to gitlab. In the group CS110_21s_Projects you should have access to your project 1.1 project. Also, in the group CS110_21s, you should have access to the p1.1_framework.

Obtain your files

  1. Clone your p1.1 repository from gitlab. You may want to change http to https.
    git clone https://autolab.sist.shanghaitech.edu.cn/gitlab/cs110_21s_projects/p1.1_xxx_xxx.git (replace xxx to your project name)
  2. In the repository add a remote repo that contains the framework files:
    git remote add framework https://autolab.sist.shanghaitech.edu.cn/gitlab/cs110_21s/p1.1_framework.git (or change to http)
  3. Go and fetch the files:
    git fetch framework
  4. Now merge those files with your master branch:
    git merge framework/master
  5. The rest of the git commands work as usual.

How to Autolab

  1. Edit the text file autolab.txt. The first line has to be the name of your p1.1 project in gitlab. So p1.1_email1_email2.
  2. The following lines have to contain a long, random secret. Commit and push to gitlab. We will test the length and randomness of this secret by running tar -cjf size.tar.bz2 autolab.txt.
  3. When you want to run the autograder in autolab, you have to upload your autolab.txt. Autolab will clone, from gitlab, the master branch of the repo specified in the autolab.txt you uploaded and then continue grading only if all of these conditions are met:
    1. The autolab.txt you uploaded and the one in the clone repo are identical.
    2. The size of the generated size.tar.bz2 is at least 1000B.
    3. Only the files from the framework are present in the cloned repo.

Collaborative Coding and Frequent Pushing

You have to work at this project as a team. We invite you to use all of the features of gitlab for your project, for example branches, issues, wiki, milestones, etc.

We require you to push very frequently to gitlab. In your commits we want to see how the code evolved. We do NOT want to see the working code suddenly appear - this will make us suspicious.

We also require that all group members do substantial contributions to the project. This also means that one group member should not finish the project all by himself, but distribute the work among all group members!
Gitlab has excellent tools to track that (see "Repository : Contributors"). At the end of Project 1 we will interview all group members and discuss their contributions, to see if we need to modify the score for certain group members.

When your project is done, please submit all the code including the framework to your remote GitLab repo by running the following commands.

  $ git commit -a
  $ git push origin master:master
  

Details of the files that you need to modify, and how to submit can be found in the Submission section.

So What Is This About?

In this part of the project, we will be writing an assembler that translates a subset of the RISC-V instruction set to machine code. Our assembler is a two-pass assembler similar to the one described in lecture. However, we will only assemble the .text segment. At a high level, the functionality of our assembler can be divided as follows:

Pass 1: Reads the input (.s) file. Comments are stripped, pseudoinstructions are expanded, and the address of each label is recorded into the symbol table. Input validation of the labels and pseudoinstructions is performed here. The output is written to an intermediate (.int) file. The symbol table is written independently to a (.symtbl) file.

Pass 2: Reads the intermediate file and translates each instruction to machine code. Instruction syntax and arguments are validated at this step. The instructions and symbol table are written to an object (.out) file.

The Instruction Set

Please consult the RISC-V Green Sheet for register numbers, instruction opcodes, and bitwise formats. Our asembler will support all 32 registers: x0, ra, sp, gp, tp t0-t6, s0 - s11, a0 - a7. The name x0 can be used in lieu of zero. Other register numbers (eg. x1, x2, etc.) are not supported.

We will have 18 instructions 5 pseudoinstructions to assemble. The instructions are:

Instruction Format
Add add rd, rs1, rs2
Or or rd, rs1, rs2
Set Less Than slt rd, rs1, rs2
Set Less Than Unsigned sltu rd, rs1, rs2
Shift Left Logical sll rd, rs1, rs2
Add Immediate addi rd, rs1, immediate
Or Immediate ori rd, rs1, immediate
Load Upper Immediate lui rd, immediate
Load Byte lb rd, offset(rs1)
Load Byte Unsigned lbu rd, offset(rs1)
Load Word lw rd, offset(rs1)
Store Byte sb rs2, offset(rs1)
Store Word sw rs2, offset(rs1)
Branch on Equal beq rs1, rs2, label
Branch on Not Equal bne rs1, rs2, label
Branch on Less Than blt rs1, rs2, label
Branch on Greater or Equal bge rs1, rs2, label
Jump and Link jal label

The pseudoinstructions are:

Pseudoinstruction Format
Load Immediate li rd, immediate
Branch on Equal to Zero beqz rs1, label
move mv rd, rs1
Jump j label
Jump Register jr rs1

Hint: You may need to implement jalr before jr.

Implementation Steps

Step 0: Get Started

Follow instructions above to obtain the framework. You can compile you code by typing make. At first, you may get a bunch of -Wunused-variable and -Wunused-function warnings. The warnings tell you that variables/functions were declared, but were not used in your code. Don't worry, as you complete the assigment the warnings will go away.

The files you need to implement are assembler.c and source codes in src/. Refer following steps for requirements.

You can also test your code by make check. Refer Step 6: Testing for more information.

Step 1: Building Blocks

Finish the implementation of translate_reg() and translate_num() in src/translation_utils.c. translate_reg() is incomplete, so you need to fill in the rest of the register translations. You can find register numbers on the RISC-V Green Sheet. Unfortunately, there are no built-in switch statements for strings in C, so an if-else ladder is the way to compare multiple strings.

For translate_num(), you should use the library function strtol() (see documentation here). translate_num() should translate a numerical string (either decimal or hexadecimal) into a signed number, and then check to make sure that the result is within the bounds specified. If the string is invalid or outside of the bounds, return -1.

Step 2: SymbolTable

Define and implement a data structure to store symbol name-to-address mappings in src/tables.h and src/tables.c. Multiple SymbolTables may be created at the same time, and each must resize to fit an arbitrary number of entries (so you should use dynamic memory allocation). You may design the data structure in any way you like, as long as you do not change the function definitions. There is a incomplete SymbolTable struct defined in src/tables.h, and you must create your own implementation. Feel free to declare additional helper methods. See src/tables.c for details.

In add_to_table, you cannot simply store the character pointer that was given, as it could point to a temporary array. You must store a copy of that string instead. You should use the helper functions defined in src/tables.c whenever appropriate.

You must make sure to free all memory that you allocate. See the Valgrind section under testing for more information.

Step 3: Instruction Translation

Implement translate_inst() in src/translate.c. The RISC-V Green Sheet will again be helpful, and so will bitwise operations.

translate_inst() should translate instructions to hexadecimal. Note that the function is incomplete. You must first fix the funct fields, and then implement the rest of the function.You will find the translate_reg(), translate_num(), and write_inst_hex() functions, all defined in translate_utils.h helpful in this step. Some instructions may also require the symbol, which is give to you by the symtbl pointer. This step may require writing a lot of code, but the code should be similar in nature, and therefore not difficult. The more important issue is input validation -- you must make sure that all arguments given are valid. If an input is invalid, you should NOT write anything to output but return -1 instead.

Use your knowledge about RISC-V instruction formats and think carefully about how inputs could be invalid. You are encouraged to use venus as a resource. Do note that venus has more pseudoinstruction expansions than our assembler, which means that instructions with invalid arguments for our assembler could be treated as a pseduoinstruction by venus. Therefore, you should check the text section after assembling to make sure that the instruction has not been expanded by venus .

If a branch offset cannot fit inside the immediate field, you should treat it as an error.

Step 4: Pseudoinstruction Expansion

Implement write_pass_one() in src/translate.c, which should perform pseudoinstruction expansion on the load immediate (li), branch on equal to zero (beqz), move (mv), jump (j) and jump register (jr) instructions. The load immediate instruction normally gets expanded into an lui-addi pair. However, an optimization can be made when the immediate is small. If the immediate can fit inside the imm field of an addi instruction, we will use an addi instruction instead. Other assemblers may implement additional optimizations, but ours will not. For the mv instruction, use the fewest number of instructions possible. Also, make sure that your pseudoinstruction expansions do not produce any unintended side effects. You will also be performing some error checking on the pseudoinstructions (see src/translate.c for details). If there is an error, do NOT write anything to the intermediate file, and return 0 to indicate that 0 lines have been written.

Caution: Although jump and link and jump and link register are not pseudoinstructions themselves, the short-hand format of these two instructions are pseudoinstructions, i.e. jal label and jalr rs1. You should also expand them to the form of jal rd label and jalr rd rs1 imm.

Step 5: Putting It All Together

Implement pass_one() and pass_two() in assembler.c. In the first pass, the assembler will strip comments, add labels to the symbol table, perform pseudoinstruction expansion, and write assembly code into an intermediate file. The second pass will read the intermediate file, translate the instructions into machine code using the symbol table, and write it to an output file. Afterwards, the symbol table will be written to the output file as well, but that has been handled for you.

Before you begin, make sure you understand the documentation of fgets() and strtok(). It will be easier to implement pass_two() first. The comments in the function will give a more detailed outline of what to do, as well as what assumptions you may make. Your program should not exit if a line contains an error. Instead, keep track of whether any errors have occured, and if so, return -1 at the end. pass_one() should be structured similarly to pass_two(), except that you will also need to parse out comments and labels. You will find the skip_comment() and add_if_label() functions useful.

As an aside, our parser is much more lenient than an actual RISC-V parser. Building a good parser is outside the scope of this course, but we encourage you to learn about finite state automata if you are interested.

Line Numbers and Byte Offsets

When parsing, you will need to keep track of two numbers, the line number of the input file and the byte offset of the current instruction. Line numbers start at 1, and include whitespace. The byte offset refers to how far away the current instruction is from the first instruction, and does NOT include whitespace. You can think of the byte offset as where each instruction will be if the instructions were loaded into memory starting at address 0. See below for an example.

The address of a label is the byte offset of the next instruction. In the example below, L1 has an address of 4 (since the next instruction is lw, whose address is 4) and L2 has an address of 8 (since the next instruction is ori, whose address is 8).

Line # Input File
1     addi t0 a0 0
2 L1: lw t1 0(t0)
3 # This is a comment
4 L2:
5     ori t1 t1 0xABCD
6     addi t1 t1 3
7
8     bne t1 a2 L2

Output File Byte Offset
addi t0 a0 0 0
lw t1 0(t0) 4
ori t1 t1 0xABCD 8
addiu t1 t1 3 12
bne t1 a2 label_2 16

Error Handling

If an input file contains an error, we only require that your program print the correct error messages. The contents of your .int and .out files do not matter.

There are two kinds of errors you can get: errors with instructions and errors with labels. Error checking of labels is done for you by add_if_label(). However, you will still need to record that an error has occurred so that pass_one() can return -1.

In pass_one(), errors with instructions can be raised by 1) write_pass_one() or 2) the instruction having too many arguments. In pass_two(), errors with instructions will only be raised by translate_inst(). Both write_pass_one() and translate_inst() should return a special value (0 and -1 respectively) in the event of an error. You will need to detect whether an instuction has too many arguments yourself in pass_one().

Whenever an error is encountered in either pass_one() or pass_two(), record that there is an error and move on. Do not exit the function prematurely. When the function exits, return -1.

For information about testing error message, please see the "Error Message Testing" section under "Running the Assembler".

Step 6: Testing

You are responsible for testing your code. While we have provided a few test cases, they are by no means comprehensive. Fortunately, you have a variety of testing tools at your service.

Our Testing Script

We have provided you with a elegant testing script. Run make check, and it will check your outputs and detection if there are any memory leak detections, and print out a nice list.

In the list, [pass] means you passed the test, [FAIL] means either your output is not correct or your program has memory leaks, [----] means that memory leak detection is not performed because your outputs are not correct.

There are 3 type of tests, testcases under test/in/p1 will test your pass one implementation, testcases under test/in/p2 will test your pass two implementation, and testcases under test/in/full will test your full assembler implementation.

To add your own testcases, for example you want to add a testcase test.s for full assembler behavior, you should:

1. Put test.s under test/in/full

2. Put your desired outputs test.int test.symtbl test.out under test/ref/full

3. Modify Makefile under test/, append " test" in variable FULL_TESTS

4. Modify test.py under test/, append 'test' in the list of 'full' in variable TESTS

Valgrind

You should use Valgrind to check whether your code has any memory leaks. The testing script will automatically do memory leak detection for you. Suppose your testcase is test/in/full/simple1.s, the valgrind output is at test/out/full/simple1.memcheck.

If you want to manually run valgrind, use following command:

valgrind --tool=memcheck --leak-check=full --track-origins=yes <whatever program you want to run>

For example, you wanted to see whether running ./assembler -p1 test/in/p1/simple.s test/out/p1/simple.int test/out/p1/simple.symtbl would cause any memory leaks, you should run valgrind --tool=memcheck --leak-check=full --track-origins=yes ./assembler -p1 test/in/p1/simple.s test/out/p1/simple.int test/out/p1/simple.symtbl.

venus

Since you're writing an assembler, why not refer to an existing assembler? venus is a powerful reference for you to use, and you are encouraged to write your own RISC-V files and assemble them using venus.

Warning: in some cases the output of venus will differ from the specifications of this project. You should always follow the specs. This is because venus 1) supports more pseudoinstructions, 2) has slightly different pseudoinstruction expansion rules, and 3) acts as an assembler and linker. You should always examine the assembled instructions carefully when testing with venus.

Diff

diff is a utility for comparing the contents of files. Running the following command will print out the differences between file1 and file2:

diff <file1> <file2>

To see how to interpret diff results, click here. We have provided some sample input-output pairs (again, these are not comprehensive tests) located in the test/in and test/ref directories respectively. For example, to check the output of running test/in/full/simple1.s on your assembler against the expected output, run:

./assembler test/in/full/simple1.s test/out/full/simple1.int test/out/full/simple1.symtbl test/out/full/simple1.out
diff test/out/full/simple1.out test/ref/full/simple_ref.out

The testing script will also automatically run diff for you

Running the Assembler

First, make sure your assembler executable is up to date by running make.

By default, the assembler runs two passes. The first pass reads an input file and translates it into an intermediate file. The second pass reads the intermediate file and translates it into an output file. To run both passes, type:

./assembler <input file> <intermediate file> <symbol table file> <output file>

Alternatively, you can run only a single pass, which may be helpful while debugging. To run only the first pass, use the -p1 flag:

./assembler <-p1> <input file> <intermediate file> <symbol table file>

To run only the second pass, use the -p2 flag. Note that when running pass two only, your symbol table will be empty since labels were stripped in pass_one(), so it may affect your branch instructions.

./assembler <-p2> <intermediate file> <symbol table file> <output file>

When testing cases that should produce error messages, you may want to use the -log flag to log error messages to a text file. The -log flag should be followed with the location of the output file (WARNING: old contents will be overwritten!), and it can be used with any of the three modes above.

Error Message Testing

We have provided two tests for error messages, one for errors that should be raised during pass_one(), and one for errors that should be raised during pass_two().

To check if your error messages match desired ones, simply run make check and check the result for full/p1_errors and full/p2_errors testcases

Your intermediate and output files (.int, .symtbl and .out files) do NOT need to match the reference output if the input file contains an error.

You can also check the testcases in test/in/full/p*_errors.s and your output in test/out/full/p*_errors.log and the reference output in test/ref/full/p*_errors.log (* refers 1 or 2)

Note that in the reference p2_errors.log, the first line is the error raised in pass one (Think about why?), and other errors are raised in pass two.

Notes regarding grading

How much will I need to write

Here is a summary of the solution code. The final row gives total lines inserted and deleted; a changed line counts as both an insertion and a deletion. However, there are many possible solutions and many of them may differ.


      assembler.c           | 111 +++++++++++------
      src/tables.c          |  93 +++++++++++++-
      src/tables.h          |   2 +-
      src/translate.c       | 198 +++++++++++++++++++++++++-----
      src/translate_utils.c | 145 +++++++++++++++++-----
      src/translate_utils.h |  57 ++++++++-
      6 files changed, 494 insertions(+), 112 deletions(-)
  

Submission

You should submit the same autolab.txt in your gitlab repo to to Autolab.

The directory tree of your gitlab repo should like the following:

You can leave the test folder and the Makefile. Autolab will replace them with the real testcases.


    |--- src
    |     |-- tables.c
    |     |-- tables.h
    |     |-- translate.c
    |     |-- translate.h
    |     |-- translate_utils.c
    |     |-- translate_utils.h
    |     |-- utils.c
    |     |-- utils.h
    |--- test (optional)
    |--- assembler.c
    |--- assembler.h
    |--- Makefile (optional)
    |--- autolab.txt
    

Autolab Results

tests 0-x stands for results for testcase x that only tests pass one.

tests 1-x stands for results for testcase x that only tests pass two.

tests 2-x stands for results for testcase x that tests full assembler.

tests x-y-mem stands for the results for memory leak detection of testcase x-y.