447 Project Guidelines

Overview

The year is 2046. As a result of many years of peace and cooperation, the international community has invested in a massive expansion of the International Space Station. On a given day, hundreds of people are in orbit. Sophisticated tools have been developed to enable high speed, private communication among individuals, and also between the station and Earth. You have been hired to create a system that will allow astronauts to send natural language messages without talking, using advanced eye-tracking hardware. (The interface for presenting incoming messages to the astronaut is not your responsibility.)

The system works by supporting, at each time step, the astronaut’s choice of the next character in the message sequence. The system displays three Unicode characters in the astronaut’s visual field; they select one, or, if the next character they need is not among the three, use a more complicated (and slow) secondary procedure to choose another option. Naturally, communication will be faster (and require less work on the part of the astronaut) if the system can accurately predict the next character of the intended message and place it among the top three. (The implementation of the secondary selection procedure is not your responsibility.)

Overall, your system’s goal is to save astronaut time. There are two parts to this goal, which may be in opposition to each other. You should aim to minimize:

Processing time: after the astronaut chooses the ith character, your system should choose the top three candidates for the (i + 1)th character as quickly as it can.
Error: as often as possible, the astronaut’s next choice should be among the three candidates. We will count the fraction of times it is not in the top three and refer to it as the error rate.

In the simulation test at the end of the quarter, we will measure both the processing speed and accuracy of your system on real data sent by real simulated astronauts.

Additional Notes

The astronaut will speak at least one human language, but you don’t know which one(s). The secondary selection procedure allows the astronaut to choose any Unicode character; you can imagine that the cost of navigating through it will be very high. The decision of which Unicode characters to allow to be possible candidates is up to you; if you are too restricted, your system might have more errors on messages in some languages. If you try to allow all Unicode characters at every time step, your processing time might suffer.
The specification for your program is quite simple, alternating between two steps:
- At the start of an iteration (including the beginning of execution), your program outputs three characters, representing a prediction that the astronaut will choose one of them.
- The program waits until the astronaut enters the next character of their message, which will arrive on standard input. For the purposes of measuring processing time, the clock starts when this character is entered and stops when the three characters arrive.
How you build your system is entirely up to you. You are free to use any resources (code or data) that are available to you. Obviously, you must not violate any laws or terms of service.
There is one caveat: you may not simply call pre-existing model APIs for your implementation. You may fine-tune an open source model if you wish, but you cannot use any pre-existing models out of the box, especially if they require a paid subscription that the graders may not have access to. If you have any questions about what this means, please ask the course staff.
The project is meant to be completed by teams of three people.
We record the time it takes to run the docker run command (that implements your predict.sh file). Ideally, it should create a valid pred.txt given a test set similar to the example input on the repository. We have set a timeout of 60 minutes.
To avoid errors, please test your code by running it from scratch on a different system, and ensure that everything works as expected to get full credit. We will likely use an NVIDIA L4 virtual machine, so you may leverage your Google Cloud credits to test that your prediction code runs on that VM without any unexpected error.
We recommend following some of the best practices while coding:
Ensure that your code does not explicitly rely on a specific type of GPU/CPU. Check for GPU availability if it helps your code and use .to(device) for all relevant tensors and model. Make sure it actually uses the GPU by checking nvidia-smi.
Process the test data in mini-batches, if possible.
Use try/except statements. For example, if a batch throws an error due to something odd in a particular sample, you can predict some random character in the ‘except’ part and move to the next batch. This will help you gain points for the correct predictions, instead of the code terminating due to an error and losing all points.
Please avoid using old versions of Pytorch (1.x) and CUDA (10.x) as it often conflicts with GPU machines, or else we might have to run your code on a CPU. Use newer versions of PyTorch.
Ensure that your docker image does not install unnecessary packages that lead to a larger size.
Ensure that your predict.sh file does not call any training-related functions. It should be solely an inference script that loads your model to perform prediction on test set.

Deliverables and Deadlines

At each checkpoint, you will turn in all source code and an executable program or script. Please sign up for your groups and submit your checkpoints on Canvas. When submitting your project, please follow the instructions and specifications in this GitHub repo. The checkpoints are as follows; deadlines for each checkpoint are shown on the course calendar.

Checkpoint 1 (Jan 27)

Program runs to spec (no error or processing speed measurements).
Graded only on turning in the program on time and running to spec.
Submit a short document that contains the following:
- Dataset: what kind of data are you going to use to train your model, and how will you obtain this data?
- Method: what kind of method will you use, and how will you implement it (e.g. language, framework)?

Checkpoint 2 (Feb 12)

Program runs to spec; we’ll measure error and processing time (and report them back to you along with your rank among teams in the class on both measurements).
Graded only on turning in the program on time and running to spec
Your team will get a bonus point if your system is on the Pareto frontier (i.e., there may be systems with lower error or lower processing time than yours, but no system with both).

Checkpoint 3 (Feb 26)

Program runs to spec; we’ll measure error and processing time (and report them back to you along with your rank among teams in the class on both measurements).
Must show improvement over checkpoint 2 (either an increase in success rate or a reduction in processing time, or both)
- We expect that you do not intentionally manipulate the grading system by making random predictions just to reduce processing time. If processing time decreases, your success rate should remain reasonably close (reduction by ~20%) to your previous checkpoint.
Your team will get a bonus point if your system is on the Pareto frontier (i.e., there may be systems with lower error or lower processing time than yours, but no system with both).
Submit a short document that contains the following:
- Thoughtful use of AI: If you didn’t know how to code, how would you use existing LLMs to accomplish this task? Write a brief (1 paragraph) description of how this could be done
- Benchmarking: Implement your ideas on using existing models and write a brief (1-2 paragraph) report on the results. You will need to use more than just the example data given in the github repo for full credit.

Checkpoint 4 (Mar 12)

Program runs to spec; we’ll measure error and processing time (and report them back to you along with your rank among teams in the class on both measurements).
The course staff will give up to three bonus points to systems on the Pareto frontier (i.e., there may be systems with lower error or lower processing time than yours, but no system with both).
At checkpoint 4, you will also turn in a report of your project. Please ensure:
- The report must be no more than one page (references don’t count against the page, figures/tables do), letter size, 1-inch margins, 11-point Times font, submitted as a pdf.
- Describe your approach, making use of concepts and methods learned in class.
- Describe the data you collected, existing datasets you used, and existing code libraries or packages you used. If there’s any question at all about how to acknowledge the work of others, talk to the course staff and we’ll be happy to help.
- The course staff may offer bonus points to exceptionally well-written reports. The course staff will also award up to three bonus points for well written literature reviews and data statements (more about this below).

Grades

By default, all members of your team will share the same grade, counted out of 50 points.

Checkpoint 1 Rubric (5 points)

Program

Successful run	Error	No submission
1 pt	0.5 pts	0 pts

Dataset

Full marks	Insufficient details	No submission
2 pts	1 pt	0 pts

Method

Full marks	Insufficient details	No submission
2 pts	1 pt	0 pts

Checkpoint 2 Rubric (5 points, 1 bonus point possible for Pareto frontier)

Code outputs valid pred.txt

Pass	Timeout	Error	No submission
5 pts	4 pts	3.5 pts	0 pts

Checkpoint 3 Rubric (8 points, 1 bonus point possible)

Code outputs valid pred.txt

Improved over Checkpoint 2	No Improvement	Timeout	Error	No submission
3 pts	2 pts	1 pt	0.5 pt	0 pts

Thoughtful AI

Full marks	Insufficient details	No submission
2 pts	1 pt	0 pts

Benchmarking

Full marks	Detailed, but only used github example data for testing	Used other data, but insufficiently detailed report	Example data and insufficient details	No submission
3 pts	2 pts	2 pts	1 pt	0 pts

Checkpoint 4 (final test) Rubric (12 points, 3 bonus points possible)

Code outputs valid pred.txt

Pass	Timeout	Error	No submission
12 pts	8 pts	5 pts	0 pts

Bonus Points

Pareto Frontier	Above Average	Improved over Checkpoint 3	Otherwise
3 pts	2 pts	1 pt	0 pts

Final Writeup Rubric (20 points, 6 bonus points possible)

Approach

Strong understanding: clearly describes approach, using relevant concepts and methods learned in class.	Missing details: like the motivation of the method used, which methods were tried etc.	Limited/unclear: missing key methods or concepts	Not included
10 pts	6 pts	4 pts	0 pts

Dataset

Full description of datasets: including sources, collection methods, or choice of existing datasets	Partial description of datasets: some missing details or clarity	Not included
5 pts	2.5 pts	0 pts

Packages & Libraries

Complete: Lists all relevant libraries/packages used	Incomplete/Ambiguous	Not included
5 pts	2.5 pts	0 pts

Bonus Points: Literature Review

Full effort: Submitted detailed responses and extra effort in technique comparison/writing	Included	No submission
2 pts	1 pt	0 pts

Bonus Points: Data Statement

Full effort: Submitted detailed responses and extra effort in data collection/writing	Included	No submission
2 pts	1 pt	0 pts

Bonus Points: Report

Top report: Clear, well-structured, and insightful writing. Reflected research or novel ideas	Strong report: Great report, high effort - clear and well-structured writing, with minor areas for improvement	Good report: Well-written but could benefit from improved clarity or depth	No bonus
2 pts	1 pt	0.5 pts	0 pts

Individual Report

Students in this course are expected to work together professionally, overcoming the inevitable challenges that arise in the course of a team project. We recognize that, occasionally, team members behave unreasonably. To help us navigate situations where you feel a shared grade would be unfair, we invite you to submit individual updates on your team’s progress at any time during the quarter using this form.

Literature Review (Extra Credit)

For extra credit in the final checkpoint, your team may submit a Literature Review, which will require you to explore current NLP literature. Your literature review will consist of a document that thoughtfully reviews at least one paper published at a major NLP or AI venue (ACL, EMNLP, NACCL, EACL, COLM, NeurIPS, AAAI, etc) in the past 3 years and includes the following information. We will award bonus points based on effort.

A brief description of the paper(s) you chose: What is the title, where are the authors from, and when and where was it published?
Similarities and differences: What was the task studied in the paper? How was it similar to the astronaut task, and how is it different?
Techniques: How did the authors go about solving the problem they presented? Is there anything they did that will be helpful in your work?
Was it easy to find a paper on a similar topic? Why or why not do you think that was? If the limitation of papers within the past 3 years was removed, do you think it would have been easier or harder?
Any other thoughts you have on the paper?

Data Statement (Extra Credit)

For extra credit in the final checkpoint, your team may submit a Data Statement, which will require you to examine your data sources. Your data statement will consist of a document which answers the following questions. We have included examples in each question below to guide your responses. If you are unable to measure a particular characteristic of your dataset, clearly state so (and provide justification of why it is difficult), and suggest possible way(s) you could measure that characteristic, if you had infinite resources. We will award bonus points based on effort.

Data source: What is the source of your data? (dataset from X, crawled from website Y, curated by research group Z, machine generated, etc.)
Data size: What is the (total) size of your data? What is the data size for each language?
Preprocessing procedure: How is the data preprocessed? Please also describe any steps of data cleaning/filtering.
Curation rationale: Which texts were included and what were the goals in selecting these texts (for each language)? For example, you may write something like “we included books from genres A, B, and C because we believe they will cover a large set of vocabulary.”
Language variety: What languages do your texts include? Give English descriptions along with ISO 639-1 language codes for each language in your texts. For example, if your text contains English and German, you would write: “English (en), German (de).”
Speaker demographics: Who produced the texts in your data? Include as much information as you can for each of these items: age, gender, race/ethnicity, native language, socioeconomic status, number of speakers. For example, if your data is crawled from Twitter, you should do your best to ascertain the age, gender, and language distributions for that platform’s users (and of course, cite your sources).
Speech situation: In what situations are the texts being produced? Details may include: modality (spoken/signed, written), scripted/edited vs. spontaneous, intended audience.
Text characteristics: What are the genres and topics of the texts? For example, you may write something like “scientific fiction books,” “biomedicine journal articles,” “comedy movie scripts,” etc.
Ethical considerations: What ethical concerns are associated with your dataset and/or curation procedure? For example, you may mention that you did not receive consent from users to use their social media posts, but that you have curated the data in accordance with the platform’s terms of service (be sure to check!).