The remainder of your grade will be determined through a semester-long group project. The goal of this project is to expose you to how we create and work with culture as data. As stated above, this project will be completed in groups, and you will be assessed both on your individual contributions and the group’s final submission. The final project is modeled on the Responsible Datasets in Context Project (RDC) https://www.responsible-datasets-in-context.com/, which was created to help students “work with data responsibly.” While we will be using these datasets to practice and learn how to programmatically work with data, they also provide an example of how best to curate and share data about complex cultural phenomena and objects.

As the authors of the project write in the mission statement:

“Data cannot be analyzed responsibly without deep knowledge of its social and historical context, provenance, and limitations. Anyone who works with data—from academic researchers to industry professionals—will know this claim to be true.

But despite its significance, social and historical knowledge and methodologies are one of the most neglected areas in undergraduate computing education. In classes, it is very common for students to use datasets that they find on websites like Kaggle, datasets that are poorly documented and that students thus don’t fully understand. This is a recipe for irresponsible data work and a bad habit that can become a dangerous habit as the stakes get higher.”

While I do not expect you to create as polished or extensive of an output as the datasets available on the RDC Project, you will be working collaboratively to create a first draft of what could eventually be part of this project.

Project Requirements

As the creators of the RDC correctly note, there is no scarcity of data in the world, especially when it comes to cultural objects and practices. This abundance presents a challenge for our course: How do we ensure that you are engaging in responsible data creation rather than merely reusing existing datasets?

Responsible Data Creation & Curation

To address this, I encourage you to use existing data only under specific conditions. If you choose to work with pre-existing datasets, you must locate the original objects or sources from which the data was derived and document the processes that transformed those objects into data. It is essential that you critically assess how much the cultural object’s historical reality has been altered in the process of datafication.

For example, many movie datasets available on platforms like Kaggle include ratings and reviews that have been scraped from various online sources. If your group decides to use such a dataset, you will need to thoroughly document how the data was scraped and provide an assessment, to the extent possible, of how comprehensive and accurate this process was. This might involve finding original versions of the reviews and comparing them with the dataset to identify any discrepancies or normalizations that occurred during the scraping process. This reflective process, known as creating a “data biography,” is crucial to understanding and documenting the dataset’s provenance and limitations.

Additionally, if you choose to use existing data, you must augment it in some meaningful way. This could involve combining multiple datasets to create a richer, more nuanced resource or manually adding new data elements that were not part of the original dataset.

Alternatively, you may decide to create a dataset from scratch. This could be done through manual annotation or by programmatically generating data, skills that you will develop over the course of the semester. Creating data from scratch allows you to engage deeply with the cultural objects or practices you are studying and ensures that the dataset reflects your group’s specific research question.

Responsible Collaboration

Part of creating responsible datasets is also working responsibly within your group. Collaborative work is not only about dividing tasks but also about fostering a respectful and inclusive environment where all members feel valued and supported.

To ensure that the workload is equitably distributed, groups should consider assigning tasks based on each member’s strengths and experience. For example, those with more programming or research experience might take the lead on certain technical tasks, but it is crucial that every member contributes to the data creation and documentation process.

Respect and inclusivity are key to successful collaboration. Be open to each other’s ideas, and if your group cannot agree on a single focus, consider having each member create their own dataset, with the group collaborating primarily on documentation and testing. This approach allows for individual creativity while maintaining a cohesive group effort.

Documentation is an integral part of responsible collaboration. While the Project Manager will play a key role in coordinating efforts and maintaining records, it’s important that all group members contribute to documenting the processes and decisions that shape your project. This includes being mindful of how you work together and supporting one another throughout the semester.

If a group member is not meeting agreed-upon milestones, consider why they might be struggling and how the group can support them. Conversely, if a member completes their tasks quickly, rather than taking on another’s work, they should offer assistance or suggest additional milestones to advance the project.

In both scenarios, the focus is not on completing the project to perfection but on ensuring a healthy and productive work environment within the group. This requires ongoing communication, respect, and a commitment to the shared goals of the project. While you may encounter frictions and frustrations, remember that these are part of the collaborative process. The goal is to bring together your differing perspectives and skills to find common ground and build workflows that support one another.

We will be discussing these requirements throughout the semester but to help you undertake this work, the project is structured with several key milestones. These milestones are intended to help you stay on track and to provide opportunities for feedback and revision throughout the semester. Part of your weekly in-class group work activities will be completing tasks related to the project, but you will also be expected to work together outside of class to successfully complete this project.

Initial Project & Semester Planning Proposal 5%

DUE SEPTEMBER 12, 2024 (Optional Extension to September 19, 2024 September 24, 2024)

After being sorted into groups, your first task will be to collaboratively draft an Initial Project & Semester Planning Proposal. This proposal will outline your group’s research focus, the cultural practice(s) or object(s) you plan to study, and the initial ideas for the dataset(s) you will create. The proposal should include the following components:

Topics

What is your group’s primary topical focus within your broad area of cultural interest? Will you focus on a particular research question, such as how a specific cultural phenomenon is represented across different media or how public discourse around a cultural event has evolved over time? Alternatively, will your group explore a more open-ended question, such as discovering what materials are available on an emerging cultural trend or topic? Clearly define the scope of your project and how you plan to approach it.

Materials

What materials will you need to complete your project? Will you work with physical objects that you will digitize and turn into data, such as printed materials like books, historical documents, or artifacts? Or will you seek out existing digital materials and focus on studying and augmenting them, for example by scraping data from a social media platform or using publicly available databases? Be specific about the sources and types of data you will be using, and consider any challenges you might face in accessing or processing these materials.

Division of Labor

How will your group divide the labor to complete this project? Clearly outline the roles and responsibilities of each group member, taking into account individual strengths and interests. How will you manage tasks that are dependent on the completion of other tasks? For example, if data collection must be completed before analysis can begin, how will you ensure that these stages are properly coordinated? Consider how the Project Manager role will rotate and how you will maintain communication and accountability within the group.

Timeline

What is your timeline for completing tasks, given the remaining milestones and the focus of your project? Break down your project into manageable phases with specific deadlines, ensuring that you allocate enough time for data collection, analysis, and revisions. Consider any potential bottlenecks or challenges that might affect your timeline and how you will address them. This timeline should be realistic but also flexible enough to accommodate unforeseen issues.

This initial proposal must be submitted as part of your group’s GitHub repository, which will be thoroughly documented (a topic we will discuss in detail in class). You should aim for 750-1000 words and include relevant links to materials or scholarly sources. Feel free to incorporate tables, graphs, bullet points, or any other formats that will help you clearly outline the goals and plan for this project.

Ultimately, this proposal should lay the groundwork for your project and serve as your blueprint for the semester. It will guide your group’s work and ensure that you stay aligned with your research objectives.

In our first few weeks, we will be discussing what makes a good topic, and you are also welcome to schedule a meeting or contact the Instructor to brainstorm possible topics.

⚡️ This proposal should be submitted in your group's GitHub repository prior to class on September 24, 2024. You can ping the instructors on Discord if you submit early, but we will be looking from that time onwards for your materials.

Mid-Semester Dataset and Documentation Update 15%

DUE OCTOBER 22, 2024 (Optional Extension October 29, 2024)

After completing your initial project proposal, your next task is to create and submit the first version of your dataset along with detailed documentation. This submission marks your first attempt at gathering and organizing your data, and it is crucial to approach it as a draft that you will refine and revise based on feedback and further analysis. Here are the overall guidelines, but you are welcome to adapt them to your project’s specific needs and go beyond them:

  1. The Group’s Initial Dataset

The core part of this assignment is sharing the first version of your dataset. This dataset should be a substantial representation of your work so far and should reflect your group’s engagement with the requirements of the semester project, as well as the feedback from the instructors. There are no firm guidelines, but please consider the following:

  • First Attempt: This submission should represent your group’s initial efforts at creating a dataset related to your chosen cultural practice or object. While there is no required number of rows or entries for this initial submission, the dataset should be substantial enough to reflect meaningful engagement with your group’s focus and to allow for preliminary analysis.
  • Flexibility in Scope: The exact size and scope of the dataset will vary depending on your project’s focus. The instructors will provide feedback on your initial project proposal to help you determine what is both feasible and sufficient. Your goal at this stage is to create a dataset that is large enough to be useful, but not so large that it becomes unmanageable.
  1. Initial Dataset Documentation

Alongside your dataset, you must submit documentation that details the process of data collection and the choices your group made along the way. This should be included in your group’s GitHub repository and should cover the following aspects:

  • Process and Choices: Explain how you gathered the data, any challenges you encountered, and the criteria you used to include or exclude certain data points. You are encouraged to cite readings and materials from class to support your rationale.
  • Content Description: Provide a clear description of what the dataset contains. This includes an overview of the data fields, the type of information represented, and any relevant context. If your dataset is compiled from multiple sources, explain how these sources were combined and reconciled.
  • Responsibility and Contributions: Clearly outline who was responsible for each part of the dataset creation and documentation process. This ensures transparency and helps the group reflect on the division of labor and make adjustments for future work.
  1. Submission and Feedback

The initial dataset and its documentation must be submitted to your group’s GitHub repository by the deadline. This submission will form the basis for your group’s ongoing work, and you will have the opportunity to revise and improve the dataset and documentation in later stages of the project.

After this submission, the instructors will provide detailed feedback on both the dataset and the documentation. This feedback will help guide your revisions and ensure that your project stays on track. It will address the scope of your dataset, the effectiveness of your data collection methods, and the clarity of your documentation. Finally, you will be sent a group assessment survey once you submit your work, which will allow you to reflect on your contributions and the group dynamics up to this point in the semester. As a reminder, grades are based on the quality of your submission, individual contributions, and the group’s overall collaboration and processes. This is a chance to showcase your progress, reflect on your processes, and receive constructive feedback to enhance your project.

Experimenting With Datasets Update 5%

DUE NOVEMBER 19, 2024 (Optional Extension December 3, 2024)

After creating your dataset, the next crucial step is to experiment with it using various computational methods. This process not only helps you identify any data issues that need correction but also provides valuable insights into how the dataset might be used by others in future research. The goal of this phase is to refine your dataset and inform the documentation that will guide others in understanding and applying your work. This update will also be submitted via GitHub, and should include details on division of labor or issues faced. We will discuss this milestone more in-depth in class but here are some general goals and guidelines:

  1. Purpose of Experimentation
  • Exploring Data Utility: Experimenting with your dataset allows you to assess its utility in answering your research question and to discover any limitations or gaps. By applying different computational methods, you can see how the data behaves in practice and whether it yields meaningful results.
  • Informing Future Use: The experimentation phase also helps you think about how others might use your dataset. By documenting the outcomes of your experiments, you can provide future users with insights into the dataset’s strengths, limitations, and potential applications.
  1. Method Selection and Application
  • Choosing Computational Methods: Your group will select one or more computational methods that are appropriate for your dataset. These might include text analysis, network analysis, topic modeling, or other techniques relevant to your research focus. The methods you choose should align with your research question and be capable of revealing important patterns or trends within your data. You are encouraged to consult with the Instructor about potential methods.
  • Applying the Methods: Once you’ve chosen your methods, you will attempt to apply them to your dataset. You are again encouraged to consult the Instructor for assistance with applying these methods, as well as using AI chatbots. You will be primarily assessed on your efforts to apply the method, not on the success of your result. This process allows you to test the data’s validity and robustness. You may find that certain aspects of the data need revision, such as missing variables, inconsistent formatting, or incomplete entries. The results of these experiments will guide your next steps in refining the dataset.
  1. Documentation and Reflection
  • Documenting the Process: As with previous submissions, you must document the process of experimentation, including the methods you used, the results you obtained, and any issues you encountered. This documentation should be detailed and transparent, allowing others to understand your process and replicate your work. It should also include some reflection on how you divided the labor and collaborated on this phase of the project.
  • Reflecting & Submitting: You are welcome to submit the update in any format you choose, though it should be in your group’s GitHub repository, and should include all code and data used to experiment with the dataset. Finally, you are strongly encouraged to reflect on your experiment and speculate on what you or others might do next with the dataset.

Demo Data Day 5%

DUE DECEMBER 10, 2024 (Hard Deadline)

Our final class meeting will be devoted to presenting your projects, with each group having 10-12 minutes to present their project, followed by a 3-5 minute Q&A session. This will be a chance to showcase your work, reflect on your process, and receive feedback from your peers and the instructors. The presentation should include the following components:

  1. Introduction: Briefly introduce your group and your project, including the cultural practice or object you are studying and the research question you are exploring. Provide an overview of your dataset and its creation process.
  2. Dataset Demonstration: Demonstrate your dataset and its structure, highlighting key data fields and any unique features. Discuss how the dataset was created and any challenges you encountered along the way.
  3. Experimentation Results: Share the results of your computational experiments, including any insights you gained from applying different methods to your dataset. Discuss how these experiments informed your understanding of the data and its potential uses.
  4. Reflection and Future Directions: Reflect on your project as a whole, considering what you learned from the process and how you might continue to develop the dataset in the future. You are also welcome to connect your reflections to readings from the course or beyond. You should discuss any limitations or challenges you faced and how you addressed them, whether that is barriers to accessing the data or difficulties in collaborating with your group.

You are welcome to use slides or other visual aids to support your presentation, but you can also simply present your GitHub repository or other materials directly. The goal is to communicate your work clearly and effectively, highlighting the key aspects of your project and how it engages with the broader themes of the course.

Final Project Submission 25%

DUE DECEMBER 16, 2022 (Hard Deadline)

The final part of this semester-long project is your final project submission, due at the end of the semester (please note the hard deadline).

The final project submission represents the culmination of your group’s work and should include several key components that together provide a comprehensive account of your dataset, its creation, and its potential uses. This submission should be thorough and detailed, allowing others to understand, evaluate, and potentially reuse your dataset in their own research.

Final Submission Components

  1. Data Essay
  • Content: Your data essay should detail the entire process of creating your dataset, including its historical context, the complexities you encountered, and the methodologies you used. This narrative should cover the decisions made at each stage, from the initial collection of data to the revisions following your experimentation phase.
  • Labor Division: The essay should also include a clear account of how the work was divided among group members, highlighting each person’s contributions. This transparency ensures that the collaborative nature of the project is well-documented.
  • Purpose: The goal of the data essay is to provide a reflective and comprehensive account of your project, making it accessible to others who may wish to learn from or build upon your work.
  1. Dataset with Documentation
  • Dataset Submission: Include the final version of your dataset, fully revised and ready for use. This dataset should be clean, well-organized, and accompanied by detailed documentation.
  • Documentation: The documentation should describe the dataset’s structure, the meaning of each data field, and any preprocessing or cleaning steps you took. This documentation is crucial for ensuring that others can understand and use your data effectively.
  1. Guidelines for Data Use
  • Potential Uses: Provide guidelines for how your dataset might be used by other scholars. This section should include potential scholarly applications, with citations to relevant work that informed your project or could benefit from your dataset.
  • Limitations and Considerations: Discuss any limitations of the dataset, such as privacy concerns, data quality issues, or barriers to reuse. Be transparent about the dataset’s strengths and weaknesses so that future users can approach it with a clear understanding of its potential and its constraints.

Submission Requirements

  • No Word Limit: There is no strict word limit for the data essay or documentation, but the submission should be sufficiently detailed to ensure that others can understand and reuse your dataset. Quality and clarity are more important than length; aim for thoroughness in covering all aspects of your project.
  • GitHub Repository: The final project, including the data essay, dataset, documentation, and usage guidelines, must be submitted via your group’s GitHub repository by the deadline. This ensures that your work is accessible and organized in a manner consistent with professional research practices.

This final submission is your chance to showcase the full extent of your work and to contribute a valuable resource to the broader academic community. By documenting your process and providing clear guidelines for future use, you help ensure that your efforts have a lasting impact and that you contribute to a more ethical and responsible approach to representing culture as data.

Updated: