CSE 517 Project

517 Project Guidelines

Overview

The project for this course is designed to give you an opportunity (1) to engage with the current state of NLP research and (2) to contribute useful knowledge to members of that community. Specifically, your team will choose one paper from a recent NLP conference, EMNLP 2025, and attempt to reproduce its experiments. (The paper you choose must be from either the main EMNLP conference or the Findings volume. The ACL Anthology URL for a suitable paper will contain the string “main” or “findings”; if in doubt, check with course staff.) In some cases, this may be a straightforward task. In some cases, it may be impossible, for a range of reasons.

Team Composition

Your team should be composed of three people. We recommend you include (a) a CSE PhD student actively researching in NLP, (b) a CSE PhD student from some other research area, and (c) someone who is not a CSE PhD student. Diverse teams are stronger teams; working with people whose perspectives are different from yours (both in that particular dimension and in other dimensions) gives you new opportunities to learn.

Project Outcomes

Your project will do one of the following:

Reproduce the main experiments in the paper. Your report will assess the ease of reproducibility, with respect to the checklist we provide below. In addition, you should attempt at least one additional experiment that isn’t in the paper, but that you are able to conduct after having successfully reproduced the main results. For example, you could assess the sensitivity of the model to one or more hyperparameters or to the amount of training data, or measure the variance of the evaluation score due to randomness in initial parameters.
Report on your failed attempt to reproduce the paper’s main experiments. Not all papers are easily reproducible. Failing to reproduce the exact results of the original paper is not necessarily a bad thing, as long as your experiments are rigorous. Your report should identify all the questions that would need to be answered to reproduce the experiments, or discuss how the findings appear to be in error (if that is what you discover).

Both outcomes (1) and (2) are acceptable and can earn full credit.

Choosing a Paper

Some considerations in choosing a paper to reproduce:

You should find the problem tackled in the paper interesting.
You should be able to access the data you will need to reproduce the paper’s experiments.
In many cases, the authors may have made code available; this may be a blessing or a curse. You should definitely peruse a paper’s codebase before deciding on that paper.
You should estimate the computational requirements for reproducing the paper and take into account the resources available to you for the project. Some authors will have had access to infrastructure that is way out of your budget; don’t choose such a paper.
Do not choose a paper you co-authored or whose authors are collaborators of anyone on your team.

Resources

Some potentially useful links:

Reproducibility checklist from recent NeurIPS, used as part of the reviewing process for that conference
ML code completeness checklist
ML reproducibility tools and best practices

Experiment Record Template

Well before the version 1 deadline, it would be useful for your group to put together a template which has fields for all of the information you want to record from every experiment. (This could be in the form of a spreadsheet, or a rough placeholder draft with information that you fill in as you go, etc.) You should design this so that once it’s filled out for all of your proposed experiments, you will meet the overall requirements for the project (described at the end of this document).

For example, a group which is going to show training curves as their additional experiment would want a template which has fields for at least:

Training curves showing dev.-set evaluations every X training steps
Learning rate, batch size, dropout, size of their model, etc. (they shouldn’t have ``etc.’’ here; they should be very specific)
Clear connection showing which hypothesis this experiment will support
Information on the data
Computational requirements (In other words, the list that you construct should consist of the items from the numbered list in the project report description that are specific to your project, with one template filled out for each experiment you run, including baselines.)

Deliverables and Deadlines

Detailed instructions for the report are given in the latex template. It is imperative that you follow the instructions in the template carefully for versions 1 and 2; it is strongly recommended that you familiarize yourself with the template before writing the proposal.

The deadlines for each deliverable are shown on the course calendar.

Proposal (Jan 27)

Your proposal should be one page and briefly include:

A (bibtex) citation of the paper whose experiments you plan to reproduce, with a URL
The hypotheses in the paper you plan to verify by reproducing experiments
A short description of whether and how you can access the data used in the paper
Whether you will use the existing code (in that case, a link to the code) or implement yourself
A discussion of the feasibility of the computation you will need to do (essentially, an argument that the project will be feasible)

Estimation does not have to be exact, but you will get more useful feedback if you include specific estimation.

There is no specific template for the proposal.

Versions 1 (Feb 12) and 2 with Final Report (Mar 12)

Fill out each section in the template by replacing the instructions with the actual content. You must follow this template for both versions 1 and 2 (final report). The final report must not exceed 8 pages, excluding references and the one-page summary that comes first.

For version 1 (due Feb 12), you need only complete the following sections, and put a placeholder (TODO) for the other sections:

Introduction
Scope of reproducibility
Methodology
Model description
Data description
Implementation (you only need to say whether you will use existing code or implement yourself)
Computational requirements (you only need to include an estimate)

For the final report (due at the end of the quarter), all sections in the template must be filled in.

Policies and Submission Notes

Grades are shared by your team. Students in this course are expected to work together professionally, overcoming the inevitable challenges that arise in the course of a team project. We recognize that, occasionally, team members behave unreasonably. To help us navigate situations where you feel a shared grade would be unfair, we invite you to submit individual updates on your team’s progress at any time during the quarter using this form.
No late submissions are allowed. Your team will receive zero points for the late submission.
Instead of submitting code, set it up as a public Github repository and add the link to the project report. If writing your own code, make sure it is documented and easy to use (this project is about reproducibility!). Include a link to a github repository which can be installed and run with a few lines in bash on department machines. Include a description of how difficult the algorithms were to implement. If using public code from the original repository, more of your energy will go into running additional experiments, such as hyperparameter optimization, ablations, or evaluation on new datasets (see below). However, note that it’s not always trivial to get a public code release working!
You may include an appendix in your final report. However, you should include all the important details in the main paper. The appendix is allowed so that your report will be helpful to future researchers; it will not be read by the course staff.
Submit your reports through Canvas, as a single pdf per submission, submitted once by a single team member.

Grades

The project is worth 50% of your course grade, allocated as follows:

The proposal is worth 10% of the project grade, 5% of the course grade.
Version 1 is worth 25% of the project grade, 12.5% of the course grade.
Version 2 is worth 65% of the project grade, 32.5% of the course grade.

You can find the grading rubric below.

Rubric

Proposal (5pt)

You need to have all seven items listed in the project instructions (your minimal viable action plan, your stretch goals, a citation to the original paper, the hypotheses to be tested, a description of how you will access the data, whether you will use the existing code or not, and a discussion of the feasibility of the computation).

-1pt for each item that is missing
-1pt if it is over 2 pages

Report - Version 1 (10pt)

Follow the final report template and fill out the sections. Some sections have to be filled in (list below; this is to ensure that you are on track), but the rest of the sections do not have to be completed. For the sections that are not completed, you should write a placeholder (e.g. “TODO”) to indicate that you will complete the section in the final report.

Completion of the following sections (8pt)
- Introduction (2pt)
- Scope of reproducibility (2pt)
- Methodology
  - Model description (1pt)
  - Data description (1pt)
  - Implementation - you only need to say whether you will use existing code or implement it yourself (1pt)
  - Computational requirements - you only need to include an estimate (1pt)
A placeholder for all sections that are not completed (2pt)

Report - Final (100pt)

Note: 120 total points are possible, and your final score will be a min over 100 and your points. In this way, you can get a full score even if you miss some points.

One-page summary (5pt)

Include the following items (5pt)
- Motivation
- Scope of Reproducibility
- Methodology
- Results
- What was Easy
- What was Difficult
- Communication with Original Authors
-4pt for going over one page

Introduction (5pt)

A clear, high-level description of what the original paper is about, what its contributions are, and why it is worthy of a reproducibility attempt (briefly motivate the work). (5pt)

Scope of reproducibility (12pt)

Write the report to be self-contained; assume the reader doesn’t have the original paper fully in their mind when they read your report. Your report needs to give enough of a summary that everything that follows will make sense.

Formatting (4pt)
- Full score: The hypotheses tested in your report are written as ‘lists,’ either a list environment (preferably numbered) or a numbered list in a paragraph. (4pt)
- -1pt if written as a paragraph without numbering.
- -2pt if hypotheses are not clear and specific.
Content (8pt)
- Full score: At least one of hypotheses you list was a central claim in the original paper, and all hypotheses are supported by experiments in your report. (8pt)
- -4pt if no hypotheses you list were a central claim in the original paper.
- -4pt if in your report, you don’t test all of the hypotheses that you listed.

Methodology (45+5pt)

Note that some of these elements may not be relevant or important for some papers (e.g., some papers aren’t about modeling). Points won’t be taken off for elements that aren’t necessary in the context of your work, but you should make it clear to the reader why these elements are missing.

Model description (5pt)
- -2pt deducted for any missing items, -1pt deducted for described but unclear items:
  - Citation or link to the model
  - Model architecture
  - Training objective
  - # of parameters
  - Other important details, such as which pretrained model is used, etc
Dataset description (5pt)
- -2pt deducted for any missing item:
  - Citation or link to the data
  - Source of the data (e.g. if they are annotated, brief description of how)
  - Statistics (dataset size, dataset split, label distribution, etc.)
  - You split the dataset into training, validation and test sets (for example, if you do not have a validation set, no points)
Hyperparameters (5pt)
- Report hyperparameters including learning rate, dropout, hidden size, etc. (Even if you’re using a standard model, it’s still good to summarize the details for the reader). (5pt)
  - -3pt for missing crucial hyperparameters from the paper.
Implementation (20pt)
Note: code should include everything necessary to reproduce the original paper AND your paper (for experiments you did beyond the original paper, you need to provide the code as well).
If you wrote your own code entirely (20pt)
- Link to your github repo (5pt)
- [In your github repo] Code is documented, complete, and easy to use. (15pt)
  - -2pt deducted for each missing item:
    - Dependencies
    - Data download instruction
    - Preprocessing code + command
    - Training code + command
    - Evaluation code + command
    - Pretrained model (if applicable)
If at least some existing code from the original authors was used (20pt)
- Link to the original paper’s repo (5pt)
  - Link to your github repo as well, if you wrote any extra code
- [In your github repo] Additional instructions to reproduce the original paper and your paper (15pt)
  - -2pt deducted for each missing item:
    - Dependencies
    - Data download instruction
    - Preprocessing code + command
    - Training code + command
    - Evaluation code + command
    - Pretrained model (if applicable)
  - This means that if some commands are missing from the original paper’s code, you will have to write them.
  - If no additional instructions were necessary, you must state that.
Computational requirements (10 + 5pt)
Report on the computational requirements: both your estimate before running the experiments, and the actual resources that it took (10pt)
- -3pts if relevant requirements are not sufficiently documented. Relevant requirements vary among different papers, but might include GPU/CPU hours, wall clock time, type of hardware, average runtime for each epoch, number of trials and training epochs, RAM usage, disk memory, or other factors that have a significant impact. This is not a checklist. You should describe the computational requirements in a way that would be helpful if someone else wanted to reproduce the paper. What should they know?
  - -3pts if you don’t estimate the requirements based on the original paper, or don’t explain how you obtained your estimates.
  - -4pts if you don’t report on the actual requirements after running your experiments.
If computational requirements were higher than estimated, discuss why, and what efforts you made to reduce the requirements. (+5pt)

Results (35pt)

Reproducibility results (15pt)
- Organization (5pt)
  - You should start with an overview, logically group related results into sections, and relate each result to a claim.
  - Use tables and figures as appropriate to communicate results effectively.
  - Think carefully about how to make it easy for the reader to see the differences between the original findings and your experiments, while also making it clear what the original findings were.
- Content (10pt)
  - Report results for all experiments testing the hypotheses you stated. (5pt)
    - -4pt if specific numbers are not included.
  - State how your results compare to the original paper’s results. (5pt)
    - If your experiments differ in some way from the original paper’s, you must explain how they are comparable.
    - If there are negative or inconclusive results, there should be more discussion of how to clear things up.
      - Negative and inconclusive results can be useful, but you should explain why they matter (e.g., “we tried to use method X to do task Y, and it didn’t work” is not particularly useful).
      - Be cautious about drawing conclusions. If there are still more experiments required to discern the reason for an observed difference between methods, say so, and consider suggesting what experiments might help clarify the situation, even if you don’t have time to do them.
Experiments beyond the original paper (max 20pt):
You can earn full points by doing additional experiments in more than one category. A single category can be sufficient if the experiments are very comprehensive.
Additional datasets
- Additional data may be for the same task or for a different task.
Explore different methods
- Methods could include model architectures, training objectives, new ways of probing the model, etc.
- For each exploration, include discussions on what it indicates.
Add new ablations
- Ablations could include varying the size of the training data, including/excluding some component of the model to see its effect, etc.
- For each new ablation, include discussions on what it indicates.
Hyperparameter tuning
- For each hyperparameter tuning experiment, include discussions on what it indicates.
Any other reasonable ablations/analyses eligible for credits

Discussion (10pt)

Include larger implications of the experimental results, whether the original paper was reproducible, and if it wasn’t, what factors made it irreproducible. (5pt)
- -2pt if one of “What was easy” or “What was difficult” is missing.
- -5pt if both of “What was easy” and “What was difficult” are missing.
Discuss, with justification, whether the evidence from your experiments supports the original hypotheses, and discuss the strengths and weaknesses of your approach. (3pt)
Provide a set of recommendations to the original authors or others who work in this area for improving reproducibility. (2pt)

References (3pt)

References are well-formatted using bibtex, and appear in a references section at the end of your paper. (1pt)
- Each citation must include at least the following items: authors, year, title, and venue (which can be an arXiv code).
References are properly cited through your paper. (1pt)
References are provided for all previous work cited in your paper, including metrics, models, datasets, etc. (1pt)

Appendix (Optional - 0pt)

For completeness, you can include supplemental content in an appendix, but don’t expect us to read it.
- The main body of your paper must be complete and meet all rubric requirements without the appendix. For example, you might show a table of results for a main experiment in your paper, and put tables for additional experiments in an appendix. But you should still qualitatively and comprehensively report on all results that you intend to include in your paper, e.g., “a similar trend holds for the other datasets (full results in §A.1).”
- Appendices must be well-organized: separate appendices (with descriptive titles) should be created for separate topics or datasets, and be properly grouped and numbered (e.g., A.1, A.2, A.3, B.1, B.2…). When you reference an appendix, it should be clear what it’s about, and you should link to it (e.g., “detailed prompts can be found in §A.2”).
Up to 10pts deducted if we need to read it to understand your paper.

Other

Up to 10pts deducted if you have no visualizations (tables, figures) or if they are ineffective. Visualizations should be appropriate and clear, and make it easy to draw conclusions. Any time you include a table or figure, make sure it’s referenced in the text.
Up to 10pts deducted if the terminology and notation used in your paper are not clear. You should explain terminology and notation to the reader when appropriate.
Up to 10pts deducted if related work is not properly incorporated into your paper. In most cases, this kind of project probably won’t need a lot of discussion of related work, assuming you did a good job of presenting information from the original paper in a self-contained way. If there is related work that doesn’t fit the flow of the rest of your paper, you could have a separate section, but it’s not the only option. In any case, most related work should have been cited in relation to your paper’s ideas when relevant, and the relevance should be explained (don’t assume your reader is familiar with the work you cite).
Up to 10pts deducted for excessive spelling, grammar, or formatting errors, or for poor organization of writing. Organize your paper logically, and separate parts into smaller subsections if appropriate.
-10pt if the report exceeds the page limit (8), excluding references, appendices, and the one-page summary that comes first.