Data Science
Course Information
- CMPS 4790/6790 Data Science
- Fall 2024 Term, Tulane University
- Lectures Times: Wednesday, 1800 - 1915
- Room: Online Only.
- Online: All weekly meetings will be in InSpace, see Tulane Canvas for details
- Prerequisite Courses: Officially none, however, we strongly encourage students to have taken one or more computing courses and be familiar with Python, e.g., CMPS 1500/1600 Introduction to Computer Science I/II, CMPS 3160 Introduction to Data Science, or CMPS 6100 Introduction to Computer Science.
Website Information
- Webpage: https://nmattei.github.io/cmps6790/
- GitHub: https://github.com/nmattei/cmps6790
- Canvas Page: Tulane Canvas
Instructor Information
- Instructor: Dr. Nicholas Mattei, nsmattei@tulane.edu
- Office: Stanley Thomas Hall (Building 10), Room 305B
- Office Hours: Mondays 1730 - 1830, Tuesdays 1530-1630 and by and by appointment, See Link in Tulane Canvas
Teaching Assistants
- N/A, no assistants this academic year.
Course Communication Policy
There are a variety of methods you can use to get in touch with us, and we expect to be able to get in touch with you. A few general policies.
- When emailing, please email all TAs and the professor of your section. We will respond within 24 hours. Turn around may be faster, but do not rely on it.
- We expect the same from you: that you will check your email/Canvas every 24 hours. All major announcements will be distributed via the Announcements function of Canvas.
- We are all available to have drop in office hours and are available by appointment. Please reach out to us directly to setup extra time if you need more support during the semester.
- If you have a general question, please check or post on the discussion board on Canvas!. We check it regularly to answer common questions on projects and homeworks. The solution to your question might already be there!
Catalog / Course Description
This course is designed for both graduate students and advanced undergraduate students interested in understanding of both the fundamental and advanced concepts, techniques, and technologies required for collecting, processing, and deriving insight into data. Data Science is an interdisciplinary set of topics that includes everything you need to create data driven answers and solutions to specific business, scientific, or sociological questions. Topics typically covered include an introduction to one or more data collection and management systems, e.g., SQL, web scraping, and various data repositories; exploratory and statistical data analysis, e.g., bootstrapping, measures of central tendency, hypothesis testing and machine learning techniques including linear regression and clustering; data and information visualization, e.g., plotting and interactive charts using various technologies; and presentation and communication of the results of these analyses. Students should be comfortable programming in Python and familiar with the fundamentals of algorithmic analysis and computer systems.
Prerequisite: Officially None, however, we strongly encourage students to have taken one or more computing courses and be familiar with Python, e.g., CMPS 1500/1600 Introduction to Computer Science I/II, CMPS 3160 Introduction to Data Science, or CMPS 6100 Introduction to Computer Science.
Note: This is a programming and mathematics intensive class, you will be programming every day. While this course only assumes introductory programming knowledge, the assignments will require extensive programming in Python as well as mathematical maturity. Helpful things to know (but are not required) include the ability to navigate a Linux/Unix command prompt, a working understanding of graph theory, probability theory, and an understanding of basic algorithms is helpful.
Course Goals, Objectives, and Overview
Data Science is an interdisciplinary set of topics that includes everything you need to create data driven answers and solutions to specific business, scientific, or sociological questions. The goal of data science is to improve decision making based on insights from data. As a field, Data Science encompasses a set of principles, problem definitions, algorithms, and processes for extracting non-obvious and useful patterns from datasets.
This course will cover:
- Data management systems, e.g., SQL, web scraping, and various data repositories;
- Exploratory and statistical data analysis, e.g., bootstrapping, measures of central tendency, hypothesis testing and machine learning techniques including linear regression and clustering;
- Data and information visualization, e.g., plotting and interactive charts using various technologies; and
- Presentation and communication of results.
The course will use Python and be largely project and case study driven with students expected to analyze real datasets and post an analysis/tutorial publicly on GitHub at the end of the course. Note that this is a programming intensive course and familiarity with Python is expected.
Students should be comfortable with programming in at least one language (preferably Python) and have had a reasonable amount of math background including one college level math course. There are many helpful tutorials and background material at the Links & Resources Page, please make use of it.
We’ll be using Docker and a number of packages including NumPy, SciPy, SciKit, and Pandas.
Course Learning Outcomes
At the conclusion of this course students will be able to:
- Open, load, and manipulate data from various sources using industry standard tools.
- Be able to use one or more data management and storage systems to load, explore, and clean data.
- Be able to perform basic statistical analysis of the data including generating summary tables of various statistics and visualization.
- Be able to express trends and implications present in data through the use of statistical hypothesis testing including p-values and confidence intervals.
- Be able to use one or more machine learning algorithms for classification and regression.
- Be able to present the results of a complete data analysis in written, visual, and presentation form.
Program-Level Outcomes
This course is currently new and aims to serve a range of constituents, the course should serve as:
- An advanced offering to undergraduate Tulane students who have completed CMPS 3160 and are looking for more depth.
- An advanced course in data science for Tulane MS and BS students who are not in CMPS.
- A core course for students in the CMPS-MATH MS in Data Science program.
- An introductory course for Tulane MS students who are interested in Data Science.
- A course for CMPS PhD students to gain advanced knowledge.
This course fulfills the requirement of one of the CMPS 3000-level or above courses required for the coordinate major in computer science. Students need to complete three such courses in order to complete the requirements for a coordinate major. For more information on the coordinate major please see the requirements at the Registrar’s Website
Required and Suggested Student Resources
This course requires the students to purchase the zyBook Data Science Foundations with Python. This will be linked through the official Tulane Canvas.
In addition to the required zyBook, there are two texts that are very strongly suggeted for purchase. We will make extensive use of online textbooks and articles for the required reading that you will be quizzed on. You will also need access to a computer complete the assignments. If you do not have access to a computer please see the instructor ASAP.
Books Highly Encouraged
- Data Science from Scratch: First Principles with Python, Second Edition, Joel Grus. O’Reilly Media, 2019. Code but not text available on GitHub.
- Practical Statistics for Data Scientists: 50 Essential Concepts, Peter Bruce and Andrew Bruce. O’Reilly Media, 2017. Text of 2nd Printing available online. Code available on GitHub.
Other Good Online Books
- Python Data Science Handbook: Essential Tools for Working with Data, Jake VanderPlass. O’Reilly Media Inc., 2016. Available online for free at: https://github.com/jakevdp/PythonDataScienceHandbook
- This textbook also has the entire book as a notebook, with examples on this GitHub page.
- Computational and Inferential Thinking: The Foundations of Data Science, Ani Adhikari and John DeNero. A free online textbook that includes interactive Jupyter notebooks and public data sets for all examples at: https://www.inferentialthinking.com/chapters/intro
Evaluation Procedures and Grading Criteria
This course will consist of seven distinct grading areas. Note that all point values described below for individual assignments are subject to change. All points are points, relative percentages are given. More information about all the assignments can be found on the Assignments Page and the Final Tutorial Page. All assignments and due dates are posted in Tulane Canvas.
Category | Points | Percentage | Group Percentage |
---|---|---|---|
In Class / Attendance | 30 | 3.53% | |
5 Labs | 50 | 5.88% | |
13 Zybooks Assignments | 65 | 7.65% | Class, Readings, Labs |
5 Discussions | 25 | 2.94% | 20.00% |
Project 0 | 25 | 2.94% | |
Project 1 | 75 | 8.82% | |
Project 2 | 75 | 8.82% | Projects |
Project 3 | 75 | 8.82% | 29.41% |
Test 1 | 100 | 11.76% | Tests |
Test 2 | 100 | 11.76% | 23.53% |
Milestone 1 | 40 | 4.71% | |
Milestone 2 | 50 | 5.88% | |
Final Presentation | 40 | 4.71% | Final Tutorial |
Final Notebook | 100 | 11.76% | 27.06% |
Total | 850 | 100.00% | 100.00% |
An important aspect of this course is becoming a better coder. Hence all coding assignments will consist of a Professionalism component. One handy resource for this is Arie’s Coding Guide. Note that this was written for Intro. to AI CMPS 3140 but many of the same issues apply to this course.
In Class Activities, Participation, and Attendance
Attendance will be monitored through an in-class survey (nearly) every day of class. You are required to fill out this survey every day, synchronously in class unless you contact the instructor ahead of time to make alternative arrangements.
Throughout the semester we will regularly complete short in class exercises such as brainstorming activities, think-pair-share, pre/post questions, and short answer writing. As we are online sometimes this may include posting things on the discussion board or answering a short quiz in Canvas on additional readings. This may also include presenting/explaining answers to labs in class.
Required Readings and Labs
The best way to learn is by doing, so these will largely be applied assignments that provide hands-on experience with the basic skills a data scientist needs in industry. You will complete a selection of readings from the zyBook which have questions as well as Labs which will ask you to read and write code in a Jupyter Notebook. Posting solutions publicly online without the staff’s express consent is a direct violation of our academic integrity policy. Late assignments will not be accepted unless you have a late token to use.
Projects
There will be 3 “mini-projects” assigned over the course of the semester (plus one simple setup assignment that will walk you through using git, Docker, and Jupyter). The best way to learn is by doing, so these will largely be applied assignments that provide hands-on experience with the basic skills a data scientist needs in industry. Posting solutions publicly online without the staff’s express consent is a direct violation of our academic integrity policy. Late assignments will not be accepted unless a late token is used, see below.
Note: Undergraduate students taking this course will be allowed to complete projects (with the exception of Project 0) in groups of exactly size 2.
Test 1 and Test 2
This will be a written, close book, live. You are allowed one hand-written study sheet, front and back, 8.5x11 inch paper. You will be required to turn in this sheet with the actual exam and it will be graded for completeness and neatness. Test 2 is cumulative. This semester both exams will take place synchronously, you must be in the session on the day of test. You must earn at least a 60% average between the two exams to pass the course.
Final Project
In the interest of building students’ public portfolios, and in the spirit of “learning by doing”, students will create a self-contained online tutorial to be posted publicly and a ~7-minute presentation in class as well as two “pitch day” project updates. This tutorial can be created individually or in a small group (max 2 people). This assignment will be a publicly-accessible website that provides an end-to-end walk-through of identifying and scraping a specific data source, performing some exploratory analysis, and providing some sort of managerial or operational insight from that data. We will have several milestones associated with the final project:
- Identifying a dataset and establishing a GitHub.io Site, Extraction, Transform, and Load (ETL). This is accompanied by a in class project pitch.
- Exploratory Data Analysis (EDA) Your notebook from Part 1 but expanded to include graphs, visualizations, and stats that show you can manipulate your data and understand the dataset you are working with. This is accompanied by a in class project pitch.
- A final, in class presentation.
- A final tutorial and website which must include in addition to ETL and EDA a statistical or predictive model and testing.
A complete version of the assignment as well as all the past assignments can be found at the Final Tutorial Page.
Additional Work for Graduate Students
Graduate students who enroll in this course will be required to complete their projects and their final project/tutorial as an individual.
Policies Related to Turning in Work
- All work will be turned in on Canvas. All work will either be distributed via Canvas or in the case of Labs and examples, via the CMPS 6790 Github.
- All work will be due at class time on the day assigned. This means turning things in during or after class is considered late. This will be consistent throughout the semester.
Late Work Policy
All work must be turned in on time unless explicit consent for outstanding circumstances is given beforehand (or in the case of illness, with a documented absence after). Any late work without prior authorization, or a late token in the case of projects, will not be accepted and count as a 0.
Late Tokens
At the start of this semester you each are holding 3 late tokens which can be redeemed at any time for projects, labs, or readings only, they may not be used for milestones. Each token is worth one additional day, to use at your discretion. These tokens have no cash value but are worth 0.5% of your total grade at the end of the semester. Note that you cannot use a late token if lab solutions have already been presented in class.
Final Grade Policy
The weighted average will determine your letter grade roughly as follows, +/- grades will be given for borderline cases.
- A >= 90%
- B >= 80%
- C >= 70%
- D >= 60%
- F < 60%
All grades will be posted on Canvas throughout the semester.
Schedule and Workload
See the Schedule Page for the schedule and assignments.
This is an upper division / graduate computer science course, it is hard, there will be a lot of work. You will sometimes have multiple assignments at a time and be responsible for managing the deadlines. Expect to spend 4-6 hours per week outside of class on this course (Tulane policy is 1-2 hours per hour in class).
If you need help please check the discussion board on Canvas!. We check it regularly to answer common questions on projects and homeworks. The solution to your question might already be there!
Students are reminded to make use of office hours. Please reach out to any of the course staff whenever you need and we can make appointments to meet if you require it.
Attendance
Students are required to attend all classes and labs (either in person or virtually) unless they are ill or prevented from attending by exceptional circumstances and with a valid excuse note. Students are responsible for notifying instructors about absences that result from serious illnesses, injuries, or critical personal problems. Students with frequent absences will be reported and/or removed from the course according to university policy.
If a student cannot attend class for any reason, the student is responsible for communicating with their instructors to make up any work they may miss. Faculty will provide online options for class participation, outlined in this document, and unless a student is seriously ill, they are expected to use this option. The University Health Center will provide documentation verifying a student is ill, as well as verification that a student may return to class. With the approval of the Newcomb - Tulane College dean, an instructor may have a student who has excessive absences involuntarily withdrawn from a course with a WF grade after written warning at any time during the semester.
Other Course and Tulane Policies
Use of Electronic Devices
Please silence your cellphones during class. If you want to use a laptop or other device with a large screen for note taking please sit in the back rows of the classroom – it’s distracting to other students https://www.scientificamerican.com/article/students-are-better-off-without-a-laptop-in-the-classroom/
Note: On Lab days you will need to bring your laptop to class to work on the labs and engage with the work for the day.
Student Support Services
As we move to remote/hybrid teaching Tulane has moved a number of student success resources online. Please visit the Virtual Learning Student Support Pages for more information.
Please come talk to us if you feel you are behind or overwhelmed in this class. We can work with you and Tulane provides a suite of services to help you succeed in this course including the following. For more information please visit the student support services webpage.
- Academic Advising - Advising maximizes student potential by sharing information, tools, and resources that empower them to make informed decisions about creating appropriate plans to achieve their academic goals.
- Academic Learning & Tutoring - The ALTC supports students through supplemental instruction, peer tutoring, writing and presentation consultations, pop-up review sessions, study space, and online learning resources.
- Case Management and Counseling - Students can leverage support services such as CAPS for Counseling Services, Case Management and Victim Support Services, and Goldman Center for Student Accessibility.
- Success Coaching - Coaches help students create actionable steps to meet goals on topics such as college transition, time-management, motivation, testing anxiety, stress management, and decision-making.
ADA / Accessibility Statement
Tulane University strives to make all learning experiences as accessible as possible. If you anticipate or experience academic barriers based on your disability, please let me know immediately so that we can privately discuss options. I will never ask for medical documentation from you to support potential accommodation needs. Instead, to establish reasonable accommodations, I may request that you register with the Goldman Center for Student Accessibility. After registration, make arrangements with me as soon as possible to discuss your accommodations so that they may be implemented in a timely fashion. Goldman Center contact information: goldman@tulane.edu; (504) 862-8433; http://accessibility.tulane.edu.
Recording of Class Sessions
Recording class sessions: Classes will be recorded and the recordings will be posted to Canvas. Students may not post a class recording elsewhere, either wholly or in part. Instructors may use a class recording in another course or in a subsequent semester.
Code of Academic Conduct and Academic Integrity
This course will follow Tulane’s Code of Academic Conduct. Cheating will be reported to the Associate Dean of Newcomb-Tulane College. Discussion is encouraged. However, what you turn in must be your own. You may not read another classmate’s solutions or copy a solution from the web. I will be running checks on the code turned in for plagiarism. If plagiarism is detected the minimum penalty is a 0 on the assignment and being reported, however, you may automatically fail this course at the discretion of the instructor or Honor Board.
For more information about the honor board process and the code of academic conduct please see the NTC Academic Integrity Website.
To be more clear (text from Hal Daumé III): Any assignment or exam that is handed in must be your own work (unless otherwise stated). However, talking with one another to understand the material better is strongly encouraged. Recognizing the distinction between cheating and cooperation is very important. If you copy someone else’s solution, you are cheating. If you let someone else copy your solution, you are cheating (this includes posting solutions online in a public place). If someone dictates a solution to you, you are cheating.
Everything you hand in must be in your own words, and based on your own understanding of the solution. If someone helps you understand the problem during a high-level discussion, you are not cheating. We strongly encourage students to help one another understand the material presented in class, in the book, and general issues relevant to the assignments. We also encourage the use of online resources to understand and clarify things, but not taking results verbatium. When taking an exam, you must work independently. Any collaboration during an exam will be considered cheating. Any student who is caught cheating will be given an F in the course and referred to the University Office of Student Conduct. Please don’t take that chance – if you’re having trouble understanding the material, please let me know and I will be more than happy to help.
The Code of Academic Conduct applies to all undergraduate students, full-time and part-time, at Tulane University. Tulane University expects and requires behavior compatible with its high standards of scholarship. By accepting admission to the university, a student accepts its regulations (i.e., Code of Academic Conduct and the Code of Student Conduct) and acknowledges the right of the university to take disciplinary action, including suspension or expulsion, for conduct judged unsatisfactory or disruptive.
Equity, Diversity, and Inclusion Statement (EDI)
Equity, diversity, and inclusion (EDI) are important Tulane values that are key drivers of academic excellence in our learning environments. In our drive for academic excellence, we seek to ensure that students, faculty, and staff across diverse social identities, cultural backgrounds, and lived experiences can thrive -especially those from from underrepresented and underserved communities (e.g., race/ethnicity, gender identity and expression, sexual orientation, disability, social class, international, veterans, religious minorities, age, and any other classification protected by applicable law -see Tulane’s Nondiscrimination Policy). In order to build a supportive culture and climate for every member of our community, we recognize that we each of have unique EDI strengths to share with others and that we also have areas for EDI growth, learning, and change. This EDI commitment and cultural humility helps us collectively build a university community and culture where everyone experiences belonging.
Religious Accommodation Policy
Per Tulane’s religious accommodation policy as stated at the bottom of Tulane’s academic calendar, we will make every reasonable effort to ensure that students are able to observe religious holidays without jeopardizing their ability to fulfill their academic obligations. Excused absences do not relieve the student from the responsibility for any course work required during the period of absence. Students should notify the instructor within the first two weeks of the semester about their intent to observe any holidays that fall on a class day or on the day of the final exam.
Title IX
Tulane University recognizes the inherent dignity of all individuals and promotes respect for all people. As such, Tulane is committed to providing an environment free of all forms of discrimination including sexual and gender-based discrimination, harassment, and violence like sexual assault, intimate partner violence, and stalking. If you (or someone you know) has experienced or is experiencing these types of behaviors, know that you are not alone. Resources and support are available: you can learn more at http://allin.tulane.edu. Any and all of your communications on these matters will be treated as either “Confidential” or “Private” as explained in the chart below. Please know that if you choose to confide in me I am mandated by the university to report to the Title IX Coordinator, as Tulane and I want to be sure you are connected with all the support the university can offer. You do not need to respond to outreach from the university if you do not want. You can also make a report yourself, including an anonymous report, through the form http://tulane.edu/concerns.
Confidential | Private |
---|---|
Except in extreme circumstances, involving imminent danger to one’s self or others, nothing will be shared without your explicit permission. | Conversations are kept as confidential as possible, but information is shared with key staff members so the University can offer resources and accommodations and take action if necessary for safety reasons. |
Counseling and Psychological Services (CAPS): (504) 314-2277 or The Line (24/7): (504) 264-6074 | Case Management and Victim Support Services: (504) 314-2160 or srss@tulane.edu |
Student Health Center: (504) 865-5255 | Tulane University Police (TUPD): Uptown - (504) 865-5911. Downtown – (504) 988-5531 |
Sexual Aggression Peer Hotline and Education (SAPHE): (504) 654-9543 | Title IX Coordinator: (504) 314-2160 or msmith76@tulane.edu |