Short paper involving replication
The big picture
You have about 3 to 4 weeks to write a short paper which replicates a study. A replication in our context may either mean directly reproducing or updating the results of a selected number of tables and findings in a study.
Because of the time frame, you are not expected to produce a perfect project or report. In addition, because this is not a thesis, you are not expected to create something entirely new. But you are expected to use everything you have learned in the course, along with whatever you bring as part of your own experience and your own understanding, to produce an informative report.
There are seven studies available for replication. Each of these seven studies have project tasks. You are to choose one of the many available project tasks. Let me know your choice by filling out the Form for project choice found in the Modules page of the ECOMETR’s Animospace, on or before 2024-03-18. After a choice is made, you are no longer allowed to change projects. The deliverables are to be submitted on or before 2024-04-12.
You may be asked questions about the study you have chosen to replicate in the final exam.
The deliverables
There are two deliverables:
- A Quarto document (qmd file), along with the rendered HTML containing the calculations and tables done using R
- Supplemental materials necessary to produce your Quarto document on top of the qmd file, specifically the qmd file should embed the R codes, automatically the datasets used, the Quarto document (qmd file)
The report will be written in Quarto documenting and discussing what you have done and found, based on the chosen project task. You are required to use R for computations. All processing, cleaning, and analysis have to be done in R and built in to your Quarto document. This means that anyone could start from loading the rawest data and along with executing the explicit commands, anyone can trace the processing of the rawest data to the data actually used in the analysis to the analysis and presentation of results.
In the end, the Quarto document should start from the loading of the datasets to the generation of results up to the final tables to be reported. The Quarto document should also include short writeups and comments (especially for the code) so that I could understand and assess what you have done. The Quarto document along with the supplemental files when rendered should produce the same HTML file you include as part of your deliverables.
Because it is a formal report, references and citations are required. Furthermore, you are to write for an audience who will be taking an econometrics course or is currently taking an econometrics course, just like you a couple of months back.
The language of the report, along with any supplemental material, has to be in English. You are not expected to have perfect grammar and diction, but you are expected to do your best to communicate your own thoughts and understanding to the reader.
A template you can use for the project is available at Animospace as a qmd file template-final-report.qmd (a text file you have to render in RStudio, after installing RStudio and Quarto) and as an HTML file template-final-report.html. You also need a bib file references.bib, which is also a text file containing the references used for the template. Both the bib and qmd files have to be in the same directory when you do the rendering. Make sure to render your document every once in a while to check what it looks like and if there are errors.
The two deliverables will be uploaded to Animospace or some other location which will be set up later on. You will be notified once the uploading area is ready, but it is very likely that it will be on a separate uploading area which will allow for syncing so that you have automatic backups and time stamps are available.
The assessment
The assessment will be on an individual basis.
You will be applying everything that you have learned in the course and perhaps more, depending on your interests and learning goals.
At the minimum, your submission should be that if I render your qmd file (along with your bib and data) and on my computer, I should be able to reproduce your rendered HTML file with minimal adjustments.
As for the tables or visualizations, you can use any R package (say,stargazer, modelsummary, sjPlot, and others) to produce your tables. This table does NOT have to be literally the same tables produced by the authors of your chosen article. It helps but you have some liberty to present a table which allows you to communicate your findings as directly as possible.
You are graded for (in no particular order):
- Documentation of how you have constructed the final dataset used in your analysis
- The compatibility of your documentation, findings, and tables with your own R code (ensure readability and make sure to have enough comments)
- Whether the process in which you have obtained the results is compatible with how the authors of original study have obtained theirs and whether you have interpreted your findings correctly
- Cohesiveness and conciseness of your entire individual paper
- Completeness of the submission: Quarto document (qmd, bib), your rendered HTML file, data in a compressed format (zip, gz, 7z, rar are all acceptable)
Notice that you are not graded for how much you may have perfectly reproduced the results of the original study, as it is possible that you do not have complete information based on the article you have chosen. You may have to make judgment calls and your own choices, which you should write up as part of your report.
You get zero credit if you do not submit on time. If there are no citations and there are indications of plagiarism, you also automatically get zero credit and you will get a 0.0 for the entire course. Violations of the restrictions laid out in the next section will also lead to automatic zero credit and an automatic 0.0 for the entire course.
Starting from a maximum total integer score of 18 (representing 18 percentage points of your grade), every element that is:
- lacking will lead to a deduction of 1
- moderately lacking will lead to a deduction of 2
- extremely lacking will lead to a deduction of 3
For example, if you got an integer score of 12, then that means you got 12 percentage points out of the 18 percentage points. The latter represents the full credit for this particular course requirement.
Restrictions
You are not to use any AI assistance (for example, but not limited to, ChatGPT or its variants) for your project. The project tasks are sufficiently narrowed down and have enough background. The expectations about the project are also calibrated enough. These aspects are designed so that it would be a personal and authentic learning experience for you and no one else. Therefore, you are free to make not-so-serious mistakes along the way.
But you will be made to face the consequences of violating the spirit of a personal and authentic learning experience. Examples include, but are not limited to, attempting to reuse past projects, contacting the original authors of the study for their code and using it as your own code, using someone else’s work and claiming it as your own, falsifying data, and other dishonest acts covered in the academic rules of the university.
You are free to discuss with your classmates in ECOMETR V28, but not any other people outside of the class list. But make sure that your discussion with your classmates is really a discussion. For example, discussion with your classmates does not mean copying or exchanging R codes. If you decide to discuss with your classmates, you have to acknowledge what you and your classmates have contributed in the discussions which ultimately led to the report. At the end of the day, you are not writing a joint paper.
The projects available
The titles of the subsections below are the titles of the seven studies available for project tasks. The studies are available in pdf format at Animospace at the course home page.
A maximum of 3 people are allowed for each project task. First come, first served.
On-the-job search and wage dispersion: New evidence from time use data
- Project 2A: Reproduce Tables 1 and 2, except for the columns which involve a Tobit regression, using the details found in the paper.
- Project 2B: Update the analysis of Tables 1 and 2 using newer data.
Shopping time
- Project 3A: Reproduce Tables 1 and 2 using the details found in the paper.
- Project 3B: Reproduce Tables 3 and 4 using the details found in the paper.
- Project 3C: Reproduce Tables 5 and 6 using the details found in the paper.
- Project 3D: Update the analysis of Tables 1 and 2 using newer data.
- Project 3E: Update the analysis of Tables 3 and 4 using newer data.
- Project 3F: Update the analysis of Tables 5 and 6 using newer data.
Labor market differentials estimated with researcher-inferred and self-identified sexual orientation
- Project 5A: Reproduce Tables 1, 2, and 3 using the details found in the paper. For Tables 1 and 2, focus only on the men.
- Project 5B: Reproduce Tables 1, 2, and 4 using the details found in the paper. For Tables 1 and 2, focus only on the women.
- Project 5C: Update the analysis of Tables 1, 2, and 3 using newer data.
- Project 5D: Update the analysis of Tables 1, 2, and 4 using newer data.
The Employment of Low-Skilled Immigrant Men in the United States
- Project 6A: Reproduce Tables 1 and 2 using the details found in the paper.
- Project 6B: Update the analysis of Tables 1 and 2 using newer data.
Does the US Labor Market Reward International Experience?
- Project 7A: Reproduce Tables 1, 2, and 3 using the details found in the paper.
- Project 7B: Update the analysis of Tables 1, 2, and 3 using newer data.
The data sources
You will be creating the dataset from scratch based on the descriptions given by the authors of the article. You are not allowed to contact the authors of the article. The major reason is so that you would have the chance to really dig into data cleaning, learn about the dataset you are using, and perhaps learn some other tools depending on your circumstance.
It is desirable to achieve a perfect reproduction of the results and your attempts should be in line with that goal. But because I want you to think of the individual paper as more of a learning experience, the grading does not depend on whether you have achieved a perfect reproduction of the results. This also does not mean that you can haphazardly do the reproduction, as it will become apparent from your report and code.
Some of the papers have code and data available online in some repository. You may consult these materials, but all processing, cleaning, and analysis have to be done in R. This means that your Quarto document should start from loading the rawest data (meaning that I could be able to go to IPUMS download your rawest data following your documentation), has explicit commands which can trace the processing of the rawest data to the data used for the reproduction and extension. Furthermore, you still have to construct the data from scratch depending on the descriptions in the article.
All the articles feature data obtainable from IPUMS. You can create an account at IPUMS USA now or at the end (just like when you are doing online shopping). Provided that you have your list of variables, you can now:
- Go to the website of IPUMS. Give yourself some time to get acquainted with IPUMS itself. Take note of the other datasets covered by IPUMS. You may have to create an account early enough because some datasets require an approval process. Pay attention to this!
- Next, visit the relevant sub-site of IPUMS. This will depend on the data used in the paper. Give yourself some time to get acquainted with IPUMS USA.
- Click on the button “Get Data”. You will be taken to a “shopping” interface which will allow you to browse and check out data extracts.
- Select samples first. Uncheck the box labeled “Default sample from each year”. Take some time to get acquainted with the IPUMS samples. Check the relevant boxes. Once you are done, click on “SUBMIT SAMPLE SELECTIONS”.
- Next, you will be exploring the variables available. Based on the information in the article, along with your judgment, select variables (use the harmonized variables) which you think would allow you to eventually reproduce the findings of the author. Ensure that the variables are indeed present in the data. You can search for the variables directly or use alphabetical listings or classifications to find them.