TELL FROM THE DATA

Data Mining and VVisualization of the StudentLife Dataset.

A large percentage of college students experience high stress levels. At CMU, more than 80% of students experienced overwhamleming during the spring semester 2017. High levels of stress of college students have the potential to undermine their mental and physical health. What factors can affect the stress level of college students? How can we monitor and predict student stress level and hence to help them proactive coping? To answer those questions, our team analyzed, visulizied, and built a machine learning prediction model based on StudentLife dataset.

- Data Cleaning
- Data Mining and Visualization

- Matlab
- Wrangler | Weka
- Data Visualization with D3
- StoryTelling

1

The Problem and Our Goal

A large percentage of college students experience high-stress levels. These high levels of stress of college students have the potential to undermine their mental and physical health. From a student’s perspective, a student may be aware of their stress status, but don't know the real factors causing stress. Also, a student might have knowledge of their own stress experience but is not exposed to the complexities of the experiences of our peers.

"There are just too many factors that results in stress. It's hard to tell where to start"

How can we help students and people who care about them understand their stress status and the potential strategies for stress coping at an individual level?

We believe understanding and visualizing student behavior patterns related to stress can better inform students and associated stakeholders. We chose the StudentLife Dataset, which contains data collected from 48 undergraduate and graduate students at Dartmouth, to find common student patterns, form the stress model, and test our model that associated with high levels of stress. For example, do students tend to sleep less and work out less when they report high levels of stress? Do students who self-claim as very anxious people report high levels of stress more often? What factors are likely to cause a student to be stressed?

Identifications and predictions of these factors would help students to be aware the impact of their personality and behavior patterns on their stress levels so that they can proactively adjust their behavior and seek help promptly. This effort can also allow university staff and health advocates to understand students' high-stress behaviors to offer support. Finally, this allows empathy: students can better identify and understand stress behaviors of their classmates.

2

Data Mining and Visualization

The StudentLife Dataset contains data collected from 48 undergraduate and graduate students at Dartmouth over the 10-week spring term. The dataset includes automatically sensed data, self-reported, and survey data to provide a holistic view of the day-to-day student experience, including sleep data, activity data, meals, academic performance data, location, and stress.

In the database, students’ stress levels were measured in two ways. Perceived Stress Scale (PSS) was used before and after the semester to evaluate students’ overall stress level. Students answered ten questions to finish to pre and post surveys. As for the second measurement, daily self-report of stress on a 5-point scale was used to measure and track students’ stress level throughout the whole semester. Students completed the task via their smartphone.

First, we wanted to learn about the distribution of students’ stress level before and after the term to get a general understanding of the how students are suffering from stress in Dartmouth College.

We grouped students into low, medium, and high stress according to their PSS score. According to Cohen’s description of the scale (1988), scores around 13 are considered average and scores of 20 or higher are considered high stress. Considering that college students might be more stressful than the average population. We adjust our criteria according to Cohen’s norm among people aging from 18-29 years old (N=645, mean = 14.2, SD =6.2). Students scored below 20.4 (Mean + 1* SD) are considered low stress. Scores between 20.4 and 26.6 are considered medium stress (Mean + 2* SD). Scores above 26.6 are considered in high stress. General results showed that:

Most students in the dataset were highly stressful compared with the norm (general population). The general stress level did not change significantly across the time.

This heat map shows individual students' stress pattern over time. In this analysis, we want to monitor students’ daily stress level as well as compare their daily stress score with their PSS scores. We pick up a few representative students from the PSS score as examples. Student 32, 22, 49, 57 are among the lowest stress students. Student 16, 17, 33, 52 are among the most stress students. **Click on the ID label to view details.**

From our observation, the heat map shows that students’ stress level are both associated with their PPS score and their context (Day of Week and Week of Term). For example, Student 17 is severely stressful every day, he/she also scored high in PSS. Student 22 and 32 seldom get stressful; they also scored low in PSS. For Student 33, who scored high in PSS , the second half of term is more stressful than the first half of the term.

Counter-intuitively, a student such as Student 57, though get a rather low score in the stress questionnaire (pre and post test), he/she still gets frequently stressed in the middle of the term. This indicates that besides personality, the context may also influence students' perceived stress level. Our observation indicated that:

The change of stress level had huge individual differences and might be events relavant.

Our furthur analysis showed that lifesytle, coursework, and personality were the factors that had significant contributions on stress.

After we dived into students' phone sensor data and daily survey results, we found some interesting behaviors patterns. An early exploration showed that

This graph shows the change of students’ activity during the term. Students’ activities were collected by smartphone sensors. Their activities were classified into four categories: Stationary, Walking, Running, and Unknown. In this graph, we divided the whole term into 3 phases by months. The first month was the beginning of the term, the second month included the midterm weeks, and the third month included the final weeks. We can see that students’ walking and running time decreases in the second and third months compared with the first month, which indicates that they were less active as the term progressed. However, there was no evidence suggest that those activity changes were significantly associated with stress.

Another exploration on students’ sleep time and coursework was more interesting.

Number of deadlines and sleep hours were correlated. They were negitively associated with stress level.

**Hover over the graph below to view details. **

This graph shows the total group of students' average stress level, # of deadlines and sleep hours across the whole term. According to the graph, students express less stressed when they sleep longer. Most of the deadlines happens on Mondays and Tuesdays. There is a weekly stress peak around Sundays and Mondays, which is indicative of an influence of days of weeks on deadlines, in turn influencing stress.

Students of different personalities may have different sensitivity towards stress. The two personality graphs are comparing the BIG FIVE result of students who get the low score in PSS (pre & post test) and who get a high score in the PSS. According to the result, low-stress students get a lower score in neuroticism, higher score in conscientiousness and agreeableness.

From our observation, neuroticism is associated with perceived stress. It is reasonable since individuals who score high on neuroticism are more likely than average to be moody and to experience such feelings as anxiety, worry, fear, anger, frustration, envy, jealousy, guilt, depressed mood, and loneliness. We will use students’ score in this trait in our Bayes Network model.

3

Predicting Student Stress

To help students monitor their stress level. We created a prediction model based on our previous findings.

From previous observation, we found that students stress level is affected by the following factors:

**Personality**: Personality works as internal factors. We found that students who got a high score on Neuroticism from the Big Five Scale were more inclined to feel stressed.

** Lifestyle**: Students’ lifestyle includes sleep hours and daily activity may also affect their perceived stress level. We observed from Figure 7 that students’ sleep hours were negatively associated with stress level. We also found that students spent less walking and running during the second half of the semester from their phone sensing data, suggesting that as the term gets busier, they exercise less.

**Coursework**: In addition to personality and lifestyle, coursework and school context are other significant factors. Based on our observation from figure 7, students’ stress level is positively associated with the number of deadlines.

We consider all those factors while building our machine learning model.

The model was tested on a set of withheld data from the original dataset, consisting of stress reports from five students through the term. Many of these daily stress reports included missing data for some of the attributes. Where at least half of the attributes were included, the model was run with maximum likelihood estimates for the remaining attributes.

The distribution of output stress values was compared to the distribution of stress values from the data. Assuming a Poisson distribution, the maximum likelihood estimate for the mean stress value was determined for each distribution to determine the bias of the model.

Assuming low stress = 1, medium stress = 2, and high stress = 3, the testing dataset had an expected mean of 2.04, while the model output had an expected mean of 2.16. The model is more likely to predict students at a slightly higher stress than is reality. This may be due to bias in the training dataset or introduced through the imputation process.

A large-sample z-test per Mathews (2010) was used to compare these distributions. There is not a statistically significant difference between the model output and the training dataset, but small datasets were used to compare these distributions. More data and testing will be required to determine how well the model's distribution matches that of the data.

What is your stress level? Try it out!

Based on our model, we can help you identify your stress level and what are the main contributors making you feel stressed. Please input your data per the instruction.