1. Introduction
Academic procrastination can be defined as postponing the planned studies, despite being aware of its negative consequences (Moon and Illingworth, 2005). This behavior is common in students, particularly in online education settings, in which students have to selfregulate their learning and studying (Lee and Choi, 2011). Although there is no formal quantitative definition for procrastination, traces of this behavior can be observed by looking at students’ study behavior, such as cramming the studies as deadlines approach (Perrin et al., 2011). However, despite the negative sideeffects of procrastination on students, such as on their academic performance and psychological wellbeing (Steel, 2007), dynamic datadriven approaches that can model these indicator behaviors in students are scarce.
Past research has mainly described student procrastination by summarizing student activities into static features (Perrin et al., 2011; Cerezo et al., 2017), which cannot fully represent the dynamics of students’ behavior through time. More recently, sequential models of student behaviors have been used in the study of procrastination behaviors (Park et al., 2018; Yao et al., 2020)
. However, these models fail to capture an important aspect of the cramming behavior: its relation to triggers such as course deadlines and availability of assignments, as in class schedule. Additionally, these models are not personalized and do not model factors related to individual students, such as students’ studying habits into account. Finally, they cannot deal with missing activity data and fail to estimate students’ next study times or predict their behavior in relation to various tasks and assignments. An ideal student activity model should be able to capture students’ response to the major events in the course, be personalized to learn studentspecific behaviors, and be able to predict the students’ future activity intensities as a way for early detection of procrastination, even if student sequence data is not completely observed.
Meanwhile, Hawkes processes (Hawkes, 1971), as a family of Point processes, have shown great potential in dealing with complicated sequential data in most realworld applications, including in the education domain (Yao et al., 2020). However, the stateoftheart Hawkes process models used in the Education domain suffer from the above limitations, for two main reasons. First, external stimuli and their triggering effects are conventionally parameterized as a constant, which results in ignoring factors such as class schedule and personalized student habits. For example, assignment deadline, as a format of external stimuli, may only start to show its triggering effect when it is approaching. Students’ personal habits (e.g. login time and frequency) which is a reflection of their time management skills, also could evolve over time. Secondly, the majority of the Hawkes processes model different sequences independently. As a result, only future activities of the sequences with historical observations can be predicted, whereas the future of unobserved sequences can not be inferred.
To address the abovementioned limitations, we propose stimulisensitive Hawkes process (SSHP), that models the external course stimulus in addition to the internal activity stimulus, is personalized, and can predict the next activity times towards each assignment for students. In SSHP, we represent activities in each studentassignment pair as a Hawkes process. To tackle the first aforementioned limitation, our model is designed to capture three types of external stimuli as parameterized functions of time: the effect of assignment availability, assignment deadline, and each student’s personal study time and frequency habits. To deal with the second limitation, SSHP jointly models all studentassignment pairs, imposing a lowrank structure between student and assignment parameters in the model. As a result, it can learn a personalized parameterization, even for unobserved sequences, based on the similarities shared between the students as well as the assignments. Our extensive experiments on two synthetic datasets and two realworld datasets show a significant performance improvement in future activity predictions, compared with the stateoftheart models, both when a sequence’s data is partially or completely missing. We perform ablation studies on SSHP and show that all aspects of our model, including the external and internal parts, are important in contributing to its superior performance. And finally, we show the meaningful procrastination patterns that are captured by SSHP parameters, using clustering analysis and studying their associations with student performance in the course.
2. Related Work
Procrastination Modeling in Education Domain. As there is no quantitative definition for procrastination behavior, in most of the recent educational data mining literature, procrastinationrelated behavior has been summarized by curating timerelated features from student interactions in the course. These studies aim to evaluate the relationships between these timerelated features with student performance and do not model temporal aspects of procrastination (Baker et al., 2016; Cerezo et al., 2017; Kazerouni et al., 2017; Agnihotri et al., 2020). For example, Asarta et al. examined the students’ log data from an online course use measures such as anticramming, pacing, completeness, etc. (Asarta and Schmidt, 2013)
. However, such methods are static and can not describe students’ varying behaviors over time. For another example, Park et al. classify students into procrastinators and nonprocrastinators by formulating a measure via a mixture model of perday student activity counts during each week of the course
(Park et al., 2018). However, this is not able to model nonhomogeneously spaced deadlines in a course.As none of these models consider the timing of students’ activities, they are not able to predict when the future activities will happen. Sequential data modeling via point process could potentially deal with this limitation, however, it has not been applied to procrastination modeling until recently. To the best of our knowledge, the most related attempt that is comparable to ours has been made in (Yao et al., 2020), where Yao et al. modeled each student’s activities sequence as a Hawkes process and relates procrastination to the mutual excitation among activity types. This work does not predict student’s unseen activities, rather, a procrastination measure was proposed based on the learned parameters that have shown to be better correlated with students’ grades than conventional delay measures.
Hawkes process and modeling scenarios. Hawkes processes, as a popular family of point processes, model two types of activities: activities that are triggered by external stimuli, and activities that are selfexcited by the historical activities. The intensity of these two types of activities is usually parameterized by a base rate function and an excitation function, respectively. To describe the complicated dynamics of realworld activity sequences, different stateoftheart parameterizations have been proposed. For example, Rizoiu et al. modeled the watching history of a Youtube video as a Hawkes process, and proposed to use the number of shares of a video on YouTube scaled by a constant to represent the base rate (Rizoiu et al., 2017). In another example, Bao et al. proposed to use a sinusoidal function to capture the periodical riseandfall patterns of user activities on social media (Bao, 2016). More recently, neural Hawkes models have been proposed to allow higher model capacity for learning arbitrary and nonlinear distributions of the history (Du et al., 2016; Mei and Eisner, 2017; Xiao et al., 2017). For example, Du et al. proposed to use RNN to model the arrival times of a given sequence and characterized the intensity as a function of the embedded hidden cell representations (Du et al., 2016)
. Even though such neuralbased Hawkes models allow for less bias and more flexibility than the traditional parametric models, they do not provide meaningful interpretations of the activity arrival patterns, which could be important to some scenarios such as procrastination analysis in educational settings.
In terms of the applications of sequential data modeling via Hawkes models, the majority of the stateoftheart Hawkes models treat each individual sequence as an independent input, in other words, no relationship among the sequences is assumed. As a result, sequences without any observed history are usually excluded from the study. To tackle this problem, a few stateoftheart Hawkes process approaches model all sequences jointly by assuming underlying similarity among the sequences. For example, Du et al. modeled each userproduct pair, i.e. the collection of interactions of a user to a product, as a Hawkes process. By assuming the similarity between users and products, their model learns the low rank representation of all Hawkes processes, including those that do not have historical purchasing history (Du et al., 2015). Other similar approaches have also been proposed to measure sequences similarity by using auxiliary features (He et al., 2015; Li et al., 2018; Shang and Sun, 2018). However, auxiliary information in education domains are usually excluded from the data due to privacy concerns.
3. StimuliSensitive Hawkes Process (SSHP)
3.1. Problem Formulation
Consider the case where there are students and assignments in a course. We assume that the time when student interacts with assignment depends on two things: (1) the effects of external stimuli (e.g., the deadline of is approaching, therefore student starts to review the lectures and practices on the quizzes), and (2) the selfexciting nature of the events, in other words, past events can trigger the future ones (e.g., student decides to work on assignment because they just watched the lecture video that is related to ). To capture these triggering effects which can be important in explaining students behaviors in the course, we propose to model the collection of activity timestamps of student ’s interactions with assignment , or studentassignment pair , as a point process (Sec. 5.2), characterized by a function that captures the effects of both external stimuli and the effects of selfexcitement (Sec. 5.3). All the important notations used in the following section are summarized in Tbl. 1.
Function  : intensity  Vector  
: density  
: base rate  
: selfexcitement  
: loss  Matrix  
: proximal operator  
: projection function  
Scalar  : selfexciting coef. 

: deadline  
: decay coef.  : student habit  
: deadline effects  : assignment opening  
: coef. in  : search point  
: shape parameter  Set  : event sequence  
: peak in  : observed sequences  
: base of  : matrix parameter set  
: offset in  : vector parameter set 
3.2. Modeling StudentAssignment Activity Timestamps
Formally, given a studentassignment pair , we describe it as the timestamps of all student ’s interactions with assignment : ^{1}^{1}1For simplicity, without causing any confusion, we omit the individual subscripts and in the rest of this section.. Given time , let denote the historical observations in up to, but not including, time , i.e. , where is the time of the last event that took place before time
. If the conditional p.d.f. (probability density function) of the next event’s time is defined as
, the joint p.d.f. for a realization follows:(1) 
The above conditional p.d.f is one way to characterize a particular Hawkes process, however could be difficult for model design and interpretability (Daley and VereJones, 2007). Alternatively, in this work, we adopt a more commonlyused function for the characterization of Hawkes, i.e. the conditional intensity function, which can be shown to be a function of
and its corresponding cumulative distribution function
:(2) 
3.3. Parameterization of External Stimuli and Selfexcitement
As mentioned above, we assume that there are two types of activities in the sequence of a given studentassignment pair, i.e., activities that are excited by the external stimuli, and those are selfexcited by the previous activities. The intensities of both types of activities are respectively parameterized by a base rate function and a excitation function, defined as follows:
Modeling external stimuli. We parameterize the following types of external stimuli that can trigger student’s interactions with the assignment. Firstly, the effect of student habit: we assume that each student interacts with the course based on their own periodical studying schedule. For example, some students habitually log in the course at noon every day, but some prefer to study after midnight. Secondly, the decaying effect of the assignment availability (opening): we assume that students activities can be triggered once the assignment is posted. However, this effect decays over time. For example, once an assignment is posted, students may log in and check the assignment requirements or deadlines, or revisit it later for the detailed descriptions. However, over time, this effect will die out and will be dominated by other stimuli. Finally, the deadline of an assignment: we assume that student activities can be triggered by the deadline, and this effect gets stronger by approaching the deadline and wears off eventually.
Formally, we define the base rate intensity for students at each time as a combination of each of the above stimulus as in Equation 3.
(3)  
(4)  
(5)  
(6) 
Specifically, Eq. 4 models the activity intensity triggered by students habit as a sinusoidal function. In other word, captures periodicity of length , that peaks at . can be interpreted as the minimum number of the activities triggered by the student habits, which works as a base of . Eq. 5 models the opening effects of the assignment as an exponential function parameterized by , with a decay speed of over time scaled by . This formulation will result in exponentially less number of activities, as a result of assignment posting, as time passes. Eq. 6 models the effect of deadline via a reversed lognormal function. here is the known time of the assignment deadline, represents the time when the deadline’s triggering effect on student activities is over. As a result, represents the difference between the end of the deadline’s effect and the deadline. If the effect of deadline is over after the actual time of deadline (e.g. late submission), would be negative. Otherwise, . Nonnegative controls how intense the activities are closing to the deadline and how fast this effect decays after the peak. This formulation represents that student activity intensities will peak around their last assignmentrelated activity, which is close to the deadline, either before or after it. , and respectively are the weight coefficients that describe the importance of , and .
Modeling internal stimuli. To model the effect of past activities, we adopt the following conventional selfexcitation function used in point processes:
(7) 
The above excitation function characterizes the effect of each historical event to current time , as a decaying function of the time difference between and , with the decaying speed of . Therefore, the more recent a historical event is, the more effect it has in terms of selfexcitation. can be shown to be the branching ratio under this definition, i.e. the expected number of activities that are triggered by a given activity. Thus it is called selfexciting coefficient.
Intensity function. Finally, our intensity function for one studentassignment pair can be defined as follows :
(8)  
As we can see, the intensity is the combination of base rate function that models external stimuli, and the excitation function that models the selfexcitement. The proposed intensity function falls in the category of a popular family of point process, i.e. Hawkes processes, which conventionally model the effect of all external stimuli as a constant. As our proposed model parameterizes the effects of different external stimuli in educational setting as functions of time, we call our model StimuliSensitive Hawkes process model (SSHP).
Matrix representation for all studentassignment pairs. Equation 8 above represents the intensity function for activities of one student on one assignment. To model all student activities on all assignments, one can model them as separate sequences and learn the parameters for each sequence independently. However, this kind of model will result in two limitations. Firstly, no parameters can be learned for studentassignment sequences that are completely unobserved, and thus, student activities in such sequences cannot be predicted. For example, consider a student, who has not started working on a future assignment by the end of the observation window, or a student, who skips an assignment for now and plans to come back to it later. Excluding these sequences from the study largely limits the capacity of the model in our application. Secondly, the parameters of the model that are not assignmentrelated, such as student habit parameters, are going to be learned independently for each sequence. As a result, they will lose meaning. A common approach to deal with these limitations is to extend the data collection window, which could be costly and inefficient. Another solution could be using the learned parameters from the observed sequences and applying them to the sequences that do not have observations. However, such an approach cannot provide personalized inferences, thus is not ideal.
To deal with these problems, while learning personalized parameters for students, we assume similarity between the learned parameters for all studentassignment pairs. Particularly, we represent the relationship between students and assignments as a studentassignment matrix, where a row is a student and each column represents an assignment from the course. We represent the studentassignment related parameters of the model in such a matrix format, model the studentrelated parameters of the model in a vector format (so that they are shared between all assignments for a student), and share some generic parameters of the model between all students. As a result, for example, the intensity function of studentassignment pair can be defined as the parameters correspond to the th cell in row from the parameter matrices. More specifically, the parameters are set to follow the following three structures:
(1) scalars: following the convention of Hawkes processes, we set global decay coefficient to be shared among all sequences. We also set to be a global scalar, so that time is scaled to the same unit across all studentassignment pairs.
(2) vector sets : We let = , , and to be vectors, assuming a student’s habit is unchanged across the assignments (i.e. and ). Similarly, their sensitivity to the effect of assignment openings (i.e. ). Furthermore, how fast their activities becoming intense once the deadline started affecting them (i.e. ) is also set to be shared among assignments.
(3) lowrank matrices : For each of the rest of the parameters, we consider a matrix format and assume similarity among studentassignment pairs, i.e. a low rank structure on the matrix format.
3.4. Objective Function
Maximum likelihood estimation on one sequence. Given a student assignment pair ’s historical activities over the time period , and a parameter set , the likelihood
is the joint probability of observing all historical events till time
, which has the following form (Daley and VereJones, 2007):(9) 
where and are respectively the p.d.f defined in Eq. 1 and the intensity function in Eq. 8. Directly taking the log of the above equation to obtain the loglikelihood entails complexity due to the double summations  i.e. the summation in Eq. 8 combined with the summation term introduced by the log of the product from Eq. 9. To achieve a more feasible complexity of , we use the recursive function defined as follows:
(10) 
As a result, the final explicit form of loglikelihood can be shown as below:
(11)  
, , is respectively the cumulative intensity of , and introduced due to the integral in Eq. 9, which can be obtained as below:
(12)  
(13)  
(14) 
where is the Gauss error function.
Matrix representation of all sequences. Thus far, one could model a single studentassignment pair via SSHP based on its historical observations by maximizing the loglikelihood function defined in Eq. 11. However, as mentioned in the previous section, we represent some of the parameters () in a matrix format for all studentassignment pairs and assume similarity among them, i.e. a low rank structure on the matrix. Specifically, we denote the set of the vector parameters as , and the set of the matrix parameters as and impose a lowrank structure on all in our objective function. By using trace norm as a surrogate for lowrank structure, we constraint the tracenorm of , , , , and to be small.
Loss for all sequences. Finally, we can formulate the objective function as follows, based on the collection of observed sequences :
(15)  
s.t.  
The main objective is the negative loglikelihood of observing all sequences with events, while the nonnegative constraint on is introduced to fit the definition of Hawkes that the sequences are selfexciting. All coefficients of the 3 types of external stimuli are also set to be nonnegative. is constrained to be greater than or equal to to make sure the nonnegative effect of student habit with the use of sinusoidal function, and each element is the shape parameter in the reversed lognormal function thus needs to be positive. Each cell of is set to be constrained between and to meet the assumption that the effect of assignment opening is decaying but not increasing or unchanged. We also constrain each parameter ’s trace norm in the matrix format to be small, which is equivalent to constraining the rank of to be less than or equal to .
3.5. Parameter Inference
We adopt Accelerated Gradient Method (AGM) (Nesterov, 2013) framework for the inference of parameters. Our choice is for having a faster convergence rate, especially when we have both nonsmooth trace norm and nonnegativity in the constraints. The key subroutines of AGM in our model can be summarized as follows. For a matrix format parameter , the objective is to compute the proximal operator:
(16)  
is the step size, is used to denote the current search point of , and is the gradient of loss w.r.t . is a projection function to make sure the parameter value at each step is properly constrained. More specifically, for all is set to be where the inner is a trace projection (Cai et al., 2010) and the outer projects negative values to . Similarly, the key subroutine for the inference of is shown as follows:
(17)  
is also a projection function that makes sure the constraint of is met. When a value falls out of the constrained interval, it is projected to the closet value within the interval.
We also present Algorithm 1 to effectively solve the objective according to the subroutines mentioned above.
4. Experiment Setup and Baselines
In this section, we first introduce the stateoftheart approaches that we used as baselines in Sec. 4.1. An introduction of both synthetic and realworld datasets is given in Sec. 4
.2. Finally, the experiment setup including traintest splitting and hyperparameters tuning is presented in Sec.
4.3.4.1. Baseline Approaches
In this work, we compare the proposed SSHP to the following baselines considering different aspects, i.e., model parameterization, modeling strategy (if can generate personalized predictions for unobserved sequences), and application scenarios.
Poisson (Kingman, 2005): We use the Poisson process model as the simplest baseline, where the intensity function is characterized by the event arrival rate.
HRPF (Hosseini et al., 2018): The stateoftheart Poisson factorization model proposed by Hosseini et al. (among the proposed models in the paper, this is the version that does not require usernetwork as auxiliary features). All sequences are modeled jointly, therefore, unobserved sequences can be predicted as well.
RMTPP (Du et al., 2016): A stateoftheart Neural Hawkes model that uses RNN to model the dependencies between past and future events in a sequence. The intensity function of this Hawkes model is defined based on the hidden states. All sequences are assumed to be independent.
ERPP (Xiao et al., 2017): Another stateoftheart Neural Hawkes model, which models auxiliary features as time series. These time series and the event sequences are modeled by two separate LSTMs. Similar to RMTPP, all sequences are modeled independently.
DHPR (Hosseini et al., 2018): A variation of HRPF, where an excitation parameter is used to capture the selfexcitement in the sequences. However, the excitation is represented as a hyperparameter that is shared among all sequences.
HPLR (Du et al., 2015): The stateoftheart useritem recommendation model using Hawkes processes. This model can be seen as an improvement of DHPR, in which the excitation parameter can be learned for all sequences.
EdMPH (Yao et al., 2020): The most recent approach that studies student procrastination using Hawkes processes. All activities of a student during the course are modeled in a sequence, independent from other sequences.
A summary of the baselines is presented in Table 2.
Model  Selfexciting 




Poisson  ✗  ✗  ✗  ✗  
HRPF  ✗  ✗  ✗  
RMTPP  ✗  ✗  
ERPP  ✗  ✗  
DHPR  ✗  ✗  
HPLR  ✗  ✗  
EdMPH  ✗  ✗  
SSHP 
4.2. Datasets
Synthetic Data. Presuming students and assignments, we created simulated studentassignment pairs, and sampled events for each pair using the Ogata thinning algorithm (Ogata, 1988), which is the most commonly used sampling method in the related literature. Specifically, we used the intensity function defined in Eq. 8
and sampled each of its parameters from normal distributions, where
, , , , , , , , and . We empirically set these distributions to approximate the intensity patterns observed in real data. For visualization, Fig. 1 shows a sequence generated by open library tick (Bacry et al., 2017), in which all the parameters are set to be the means. The solid blue line shows the sequence intensity, where each blue dot represents a sampled activity, the dashed orange line is the base rate, and the synthetic deadline is , shown as the vertical red line.To simulate the realworld scenarios, in which only some of the data sequences can be observed, we created two datasets, randomly masking (named as Syn data) and (named as Syn data) of the sequences to be unobserved. In other words, of the sequences from Syn dataset, and of the sequences from Syn dataset are unobserved.
Computer Science Course on Canvas Network (CANVAS). This realworld dataset is from the Canvas Network online platform (CanvasNetwork, 2016) that hosts various open courses in different academic disciplines. The computer science course we use happens during weeks. In each week, a graded assignmentstyle quiz is published in the course resulting in graded course assignments. From this dataset, we obtain K timestamps of studentassignment pairs. Activities include submission activities, module learning (reading, watching videos, etc.) activities, and discussions.
Big Data in Education on Coursera (MORF) Our second realworld dataset is collected from an 8week “Big Data in Education” course on the Coursera platform. The dataset is available through the MOOC Replication Framework (MORF) (Andres et al., 2016). In total, we extract K activities from studentsassignment pairs, that contain quiz and assignment activities, watching lecture videos, and discussionrelated activities.
4.3. Experiment Setup
We test our method in two scenarios according to our application: 1) when the historical observations are available, we want to predict what will happen in the future based on the history, and 2) when the whole sequence of activities for a studentassignment pair is completely missing, we want to infer its future without observing its history. To test the model’s performance in predicting the future in these two scenarios, we split our data into the following sets: training set that contains the initial historical observations, which is used to train the model for parameter inference; partially missing test set that contains the rest of the historical observations, that is used for testing the first scenario. Finally, the completely missing test set contains the entire observations of the sequences, and it is used to examine models’ ability in generating personalized and accurate predictions for unobserved sequences, i.e. scenario 2.
For Syn, we naturally set the masked sequences to be the completely missing test set. In the remaining 90% unmasked sequences, we use the first of the activities (i.e. synthetic past observations) to be training and the later (i.e. synthetic future activities to be predicted) to be partial missing testing. We perform a similar procedure on Syn, with 90% masked sequences to be completely missing test set, and a split in the remaining sequences for training and partially missing testing respectively. For both realworld datasets, we randomly holdout of the sequences to be completely missing, and for the rest of the sequences, we also use the same split to generate training and partially missing testing.
For the baseline models that are not able to generate personalized predictions of future times without historical observations (i.e. Poisson, RMTPP, ERPP, and EdMPH), we report the root mean squared error (RMSE) of the time prediction on the partially missing test set only, and for the other models, we report the RMSE on both partially and completely missing test sets.
The hyperparameters of proposed SSHP across all datasets are tuned via grid search on the following values: global decay ; initial step size ; update speed ; and trace norm penalty in trace norm projection . For the synthetic datasets, the best hyperparemters are set as follows: we have decay , the step size , update speed is , trace norm penalty is . In CANVAS, decay , the step size , , trace norm penalty is . In MORF, we have decay , , , and . Similarly, the hyperparameters for baseline approaches are tuned via grid search according to the ranges provided in the original papers.
5. Fit and Arrival Time Prediction
In the following set of experiments, we study SSHP’s ability to recover the correct parameters for the underlying processes, investigate its performance in predicting the next activity time compared to the stateoftheart baselines, and analyze the contribution of different parts of the model in its performance.
5.1. Model Fit on Synthetic Data
As a way to evaluate SSHP’s performance in capturing the sequence dynamics, we investigate its ability to find the true parameters of the underlying processes. Since these parameters are available from the synthetic datasets, we calculate the root mean squared error (RMSE) between the estimated parameter values by SSHP and the actual parameter values that have been used to generate the synthetic datasets. The results are shown in Tbl. 3. Generally, SSHP performs better in the partially missing test set than in the compleletly missing test. That is because the task of learning completely unobserved sequences without histories is more challenging than learning sequences with partially observed histories. Additionally, the results show that the RMSEs in Syn dataset are only marginally higher than in Syn in both partially and completely missing test sets. This suggests the model’s robustness and its potential to recover the parameters even when the ratio of unobserved sequences is high in the dataset.
Datasets  

Syn10  part. miss.  1.33  0.1  1.33  0.09  0.05  2.64  1.65  1.08  0.16 
compl. miss.  1.23  0.12  1.39  0.16  0.13  2.60  2  1.54  0.13  
Syn90  part. miss.  1.34  0.10  1.33  0.09  0.06  2.39  1.80  1.14  0.18 
compl. miss.  1.31  0.12  1.38  0.16  0.12  2.61  1.97  1.51  0.17 
To provide a visual representation of these results, Fig. 2 shows the sampled intensity of a real sequence in Syn dataset and the predicted intensity that is sampled based on the predicted parameters. This figure demonstrates the model’s ability in accurately capturing the dynamics of the sequence.
5.2. Predicting Future Event Arrival Times
Predicting the arrival times of future events for a given sequence, is the most commonly used evaluation method in the related literature. More formally, for a studentassignment pair, the arrival time of future th event after observation window, denoted as , can be computed as the expectation of the sequence intensity w.r.t to time . However, since time is continuous and the intensity functions of Hawkes processes are usually complicated, the analytic form of this expectation is hard to obtain.
Alternatively, in this work, we adopt another popular approach to predict future event arrival times. We first use Ogatha thinning algorithm to sample interarrival times , which is the time difference between th and th events. Then, we compute the predicted time of th event as , where is the trail number for the sampling and is the sampled interarrival time at the th trail. The intuition is that interarrival times are sampled times, then the sample mean of all trails is used as the approximation of the actual interarrival time. In this way, we can recursively sample the arrival times for future events from the last historical observation and the learned intensity function.
In this work, we evaluate the model performances in predicting the next future activities after the observation window is ended, using RMSE between the actual and predicted times as our measure. As the number of future activities grows, the task of predicting their arrival times becomes more challenging.
set with 95% confidence interval on Syn
and Syn.Performance on synthetic datasets. In this section, we present the experiment results for SSHP and baseline approaches on synthetic datasets. Fig. 4 shows the model performances in partially missing test set in both Syn and Syn, while Fig. 4 shows the performances in the completely missing test set. The xaxis represents the future events’ indices. For example, represents the second event in the future after the end of the observation period . The yaxis is RMSE of time predictions in the logscale, for a clearer separation between the models in the figures. Some baselines, such as ERPP and RMTPP, are missing from the lower plots since they cannot predict unobserved sequences (studentassignment sequences in completely missing test set). We can see that SSHP clearly achieves the smallest RMSE of time predictions comparing to the baseline approaches in all settings. Even though neural models ERPP and RMTPP start to show better performances in later event predictions, they are not able to predict unobserved sequences (i.e. completely missing test set). As expected, since recovering completely unobserved sequences is more challenging, SSHP’s performance on the partially missing test set is better than its performance in the completely missing test set set.
Performance on realworld datasets. Next, we evaluate each model’s performance using the two realworld datasets. It is worth mentioning that the observed history in MORF is the shortest among all datasets, having an average of less than observations per sequence, and observations for training. For this reason, the prediction window is set to be in MORF instead of to achieve meaningful evaluation. The evaluation in partially missing test set and completely missing test set is respectively presented in Fig. 6 and Fig. 6. As it is shown in the figures the proposed SSHP model outperforms the baseline approaches, especially by a big margin in Canvas’s completely missing test set. This is consistent with the synthetic dataset results. In contrast, the performances of neural models ERPP and RMTPP are not as promising as they are in the synthetic datasets, especially in MORF. One possible explanation is that the short training sequences in MORF restrict the ability of neural based models.
Another observation is that for higher indexed events in MORF’s completely missing set and for lower indexed events in Canvas’s partially missing test set, we observe overlapped confidence intervals between HPLR and SSHP, suggesting a less siginificant difference between two models’ performances. However, as the large confidence interval in HPLR suggests, its results are not robust and vary too much in the experiments. A potential explanation for the good predictions in HPLR is that for some studentassignment pairs the activity dynamics are rather invariant and a constant base rate, as in HPLR, is sufficient to capture them.
In conclusion, SSHP has shown to have superior time prediction performance in both synthetic and realworld datasets comparing with baseline approaches, especially on the challenging task of predicting the future for the completely missing test set.
5.3. Ablation Study
To verify each component’s importance in the intensity function, we compare SSHP to its variations SSHP, SSHP, SSHP and SSHP, which respectively represents the model achieved by taking out the following components: selfexcitement , effect of assignment opening , effect of student habit and effect of deadline .
Fig. 8 and Fig. 8 show the performance of these models in comparison with each other and with SSHP in respectively partially and completely missing test set.
In general, SSHP achieves lower time prediction errors in both realworld datasets, indicating the importance of each individual component. Furthermore, in the partially missing test set as shown in Fig. 8, while the improvement of modeling selfexcitement is only marginal comparing with SSHP in CANVAS (left figure), selfexcitement is shown to be a major factor in MORF (right figure), as SSHP has higher prediction errors in MORF than other variations.
Additionally, as shown in Fig. 8, we can see the differences between SSHP and its variations are much more distinct in the completely missing test set (i.e. when the history is unobserved). More specifically, in CANVAS dataset (left figure), we see that SHPP’s error is the highest among all models. This shows strong evidence of the deadlines’ effect on student activities in CANVAS, which also suggests the importance of modeling . On the other hand in MORF, when comparing SSHP and SSHP (right figure), we can see that the effect of student habit is not presented at the beginning of the sequence, as the error is lower when this stimulus is not included. However, the importance of including student habits in the model is significant after the second event. Another interesting observation is the higher confidence interval presented in SSHP. One explanation is that some students are more sensitive to assignment opening compared to the others, therefore excluding from the equation can cause a higher error to some sequences but not the others. The difference that is observed in the components’ importance in the two datasets can come from the different nature of the two educational systems and the presented courses. For example, one expects the effect of the deadline to be more prevalent in courses with a high latesubmission penalty, compared to the ones with a more flexible scheme.
To conclude, despite the different characteristics that have been unveiled in the two datasets, we can see that all three external stimuli and the selfexcitement components are important in modeling student activities.
6. Procrastination Pattern Discovery
In Section. 3, we have described the intuition behind SSHP’s parameterization. In this section, we analyze these parameters to demonstrate their interpretation and their association with student performance patterns.
6.1. Cluster Analysis
First, we investigate if the learned parameters can describe students’ behaviors in assignments in a meaningful way that shows their cramming and procrastination behaviors. To do so, we cluster all studentassignment pairs via KMeans clustering algorithm, representing each of studentassignment item as its learned parameters:
.To find the optimal number of clusters, we use the elbow method on clustering loss. In both CANVAS and MORF datasets, the achieved optimal cluster number is , which means that studentassignment interaction patterns are uncovered in both datasets. Figures 9 and 10 show the parameter values for cluster centers in CANVAS and MORF datasets, respectively. For a clearer presentation, is scaled down by (time unit changes from hours to days) and and are scaled up by in the figures respectively. Error bars show the confidence interval within each cluster.
Specifically, by comparing CANVAS (cluster in CANVAS dataset) with clusters and in CANVAS, we can see that the interactions between students and assignments in CANVAS are shown to be less sensitive to the deadline until much later, when it is too close to the deadline (smaller and larger ). Also, negative in CANVAS indicates late submissions or other assignmentrelated activities, after the deadline. Not only that, but the burstiness of the events in this cluster is also shown to be higher than other clusters (larger ). One possible explanation is that the students in this cluster procrastinated on the assignments in it and only started to work on them much later than they should have, which explains the bursty and intense activities close to the deadline. Furthermore, we can see that the effect of assignment opening or availability wears off much faster in CANVAS (smaller ), meaning that the period of time that this cluster is affected by assignment opening is shorter. This suggests that overall, this cluster is less sensitive to the assignment opening. When it comes to student habit, we see that the peak of periodicity shows up at a later time (large ), indicating that the students in CANVAS interact with the course usually later during the day, comparing with CANVAS and . On the other hand, even though the differences are shown to be smaller when comparing CANVAS and CANVAS clusters, many of them are significant. Particularly, the results clearly show that CANVAS is more sensitive to the deadline in the sense that assignmentrelated activities are finished much earlier (larger positive and ). Their base activities triggered by student habits are also shown to be more intense (higher ) even though their peak time is usually later during the day (larger ). So to conclude, learning pattern in CANVAS suggests procrastinatinglike behaviors, with less sensitivity to the deadline and the assignment opening, as well as more bursty and intense behaviors. On the other hand, learning patterns in CANVAS suggests an “early birds” type of learning behavior, in which assignmentrelated activities are finished earlier by around days. Also, they tend to be more sensitive to the opening of assignment, with less bursty behaviors, which can be interpreted as an opposite behavior of procrastination.
Similarly in MORF (Figure 10), different characteristics are uncovered by the discovered clusters. We can see that the effect of the deadline starts late and ends late (smaller and smaller negative ) for MORF and . Also, student activities on assignments in MORF and are more bursty (larger ). On the other hand, MORF is more sensitive to the effect of assignment opening for a longer period of time (larger ), and student habits also seem to have a stronger effect on MORF , suggested by larger and . Overall, we can conclude that MORF activity patterns represent the “early birds” type and MORF activity patterns show the most procrastinationlike behaviors among the 3 clusters.
By comparing the clusters where procrastinationlike behaviors are suggested between the two datasets, we can see that the parameters show different strategies in them. Specifically, in MORF , less bursty and more delayed submissions are observed (smaller negative and smaller ) than in CANVAS , which can be an indication of procrastination. Another potential explanation for this difference can be the different nature of the courses as we mentioned in the section of ablation study, where the penalty of late submissions can be stronger in CANVAS than in MORF.
6.2. Association with Grades
To show the association between student activity patterns on assignments and their performance in them, we check the student grades on assignments in each cluster. The results are presented as box plots shown in Fig. 11. As we can see, median grades in CANVAS and MORF
are visibly smaller than other clusters in their datasets. But also, the distribution of grades in each two clusters are different. To see if the differences of grade distributions between clusters are significant, for each of the datasets, we conduct a KruskalWallis test on the grades between any two clusters discovered by SSHP. We find out that all the pvalues are significantly smaller than
, suggesting significant differences in the grade distribution between all clusters. Combining these observations with the conclusions from Figures 9 and 10, we see that clusters that show procrastination behaviors with less sensitivity to the deadlines and assignment openings (CANVAS and MORF ) also are shown to have significantly lower grades. We can conclude that clusters with more procrastinationlike behaviors are associated with lower grades in both datasets. This demonstrates that SSHP can capture underlying student activity patterns with meaningful parameters that can be used as good indicators of procrastination behaviors and student performances.7. Conclusion
In this work, we proposed a novel stimulisensitive Hawks process model (SSHP) to represent student’s cramming and procrastination behaviors in online courses, according to their activities. Our model captures three types of external stimuli in addition to the internal stimuli between activities, i.e., the effect of assignment deadline, assignment availability, and student’s personal habits. SSHP models all studentassignment pairs jointly, which enables the model to generate personalized predictions for both partially missing and completely missing activity sequences. Our experiments on both synthetic and realworld datasets demonstrated SSHP’s superior performance comparing to the stateoftheart baseline approaches, especially in the more challenging task of future time prediction for time sequences where the history is completely missing. Our ablation studies on SSHP showed that each component of our model is necessary for achieving its superior performance. Finally, we demonstrated that not only SSHP excels at future time predictions, but also its model parameterization provides meaningful interpretations and insights into the association between students’ procrastination patterns and their grades. Particularly, we discovered clusters of behaviors on assignments: one with stronger procrastinating behaviors, with less sensitivity to the deadline and the assignment opening, as well as more bursty and intense behaviors; another one with “early birds” type of learning behaviors, with more sensitivity to deadlines and less bursty behaviors; and a third one in between the two. We showed that grade distributions in these clusters have meaningful differences, with the lowest grades associated with procrastinatinglike behaviors.
References
 (1)
 Agnihotri et al. (2020) Lalitha Agnihotri, Ryan S Baker, and Steve Stalzer. 2020. A Procrastination Index for Online Learning Based on Assignment Start Time. In The 13th International Conference on Educational Data Mining.
 Andres et al. (2016) Juan Miguel L Andres, Ryan S Baker, George Siemens, Dragan Gašević, and Catherine A Spann. 2016. Replicating 21 findings on student success in online learning. Technology, Instruction, Cognition, and Learning (2016), 313–333.
 Asarta and Schmidt (2013) Carlos J Asarta and James R Schmidt. 2013. Access patterns of online materials in a blended course. Decision Sciences Journal of Innovative Education 11, 1 (2013), 107–123.
 Bacry et al. (2017) E. Bacry, M. Bompaire, S. Gaïffas, and S. Poulsen. 2017. tick: a Python library for statistical learning, with a particular emphasis on timedependent modeling. ArXiv eprints (July 2017). arXiv:1707.03003
 Baker et al. (2016) Rachel Baker, Brent Evans, and Thomas Dee. 2016. A Randomized Experiment Testing the Efficacy of a Scheduling Nudge in a Massive Open Online Course (MOOC). AERA Open 2, 4 (2016).
 Bao (2016) Peng Bao. 2016. Modeling and predicting popularity dynamics via an influencebased selfexcited Hawkes process. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 1897–1900.

Cai
et al. (2010)
JianFeng Cai, Emmanuel J
Candès, and Zuowei Shen.
2010.
A singular value thresholding algorithm for matrix completion.
SIAM Journal on optimization 20, 4 (2010), 1956–1982.  CanvasNetwork (2016) CanvasNetwork. 2016. Canvas Network Courses, Activities, and Users (4/2014  9/2015) Restricted Dataset. https://doi.org/10.7910/DVN/XB2TLU
 Cerezo et al. (2017) Rebeca Cerezo, María Esteban, Miguel SánchezSantillán, and José C. Núñez. 2017. Procrastinating Behavior in ComputerBased Learning Environments to Predict Performance: A Case Study in Moodle. Frontiers in Psychology 8 (Aug. 2017).
 Daley and VereJones (2007) Daryl J Daley and David VereJones. 2007. An introduction to the theory of point processes: volume II: general theory and structure. Springer Science & Business Media.
 Du et al. (2016) Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel GomezRodriguez, and Le Song. 2016. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1555–1564.
 Du et al. (2015) Nan Du, Yichen Wang, Niao He, and Le Song. 2015. Timesensitive recommendation from recurrent user activities. Advances in Neural Information Processing Systems 2015Janua (2015), 3492–3500.
 Hawkes (1971) Alan G Hawkes. 1971. Spectra of some selfexciting and mutually exciting point processes. Biometrika 58, 1 (1971), 83–90.

He
et al. (2015)
Xinran He, Theodoros
Rekatsinas, James Foulds, Lise Getoor,
and Yan Liu. 2015.
Hawkestopic: A joint model for network inference
and topic modeling from textbased cascades. In
International conference on machine learning
. 871–880.  Hosseini et al. (2018) Seyed Abbas Hosseini, Ali Khodadadi, Keivan Alizadeh, Ali Arabzadeh, Mehrdad Farajtabar, Hongyuan Zha, and Hamid R Rabiee. 2018. Recurrent poisson factorization for temporal recommendation. IEEE Transactions on Knowledge and Data Engineering 32, 1 (2018), 121–134.
 Kazerouni et al. (2017) Ayaan M. Kazerouni, Stephen H. Edwards, T. Simin Hall, and Clifford A. Shaffer. 2017. DevEventTracker: Tracking Development Events to Assess Incremental Development and Procrastination. In Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education  ITiCSE ’17. ACM Press, Bologna, Italy, 104–109.
 Kingman (2005) John Frank Charles Kingman. 2005. P oisson processes. Encyclopedia of biostatistics 6 (2005).
 Lee and Choi (2011) Youngju Lee and Jaeho Choi. 2011. A review of online course dropout research: Implications for practice and future research. Educational Technology Research and Development 59, 5 (2011), 593–618.
 Li et al. (2018) Tianbo Li, Pengfei Wei, and Yiping Ke. 2018. Transfer hawkes processes with content information. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 1116–1121.
 Mei and Eisner (2017) Hongyuan Mei and Jason M Eisner. 2017. The neural hawkes process: A neurally selfmodulating multivariate point process. In Advances in Neural Information Processing Systems. 6754–6764.
 Moon and Illingworth (2005) Simon M Moon and Alfred J Illingworth. 2005. Exploring the dynamic nature of procrastination: A latent growth curve analysis of academic procrastination. Personality and Individual Differences 38, 2 (2005), 297–309.
 Nesterov (2013) Yu Nesterov. 2013. Gradient methods for minimizing composite functions. Mathematical Programming 140, 1 (2013), 125–161.
 Ogata (1988) Yosihiko Ogata. 1988. Statistical models for earthquake occurrences and residual analysis for point processes. Journal of the American Statistical association 83, 401 (1988), 9–27.
 Park et al. (2018) Jihyun Park, Renzhe Yu, Fernando Rodriguez, Rachel Baker, Padhraic Smyth, and Mark Warschauer. 2018. Understanding Student Procrastination via Mixture Models. International Educational Data Mining Society (2018).
 Perrin et al. (2011) Christopher J Perrin, Neal Miller, Alayna T Haberlin, Jonathan W Ivy, James N Meindl, and Nancy A Neef. 2011. Measuring and Reducing Colledge Students’ Procrastination. Journal of applied behavior analysis 44, 3 (2011), 463–474.
 Rizoiu et al. (2017) MarianAndrei Rizoiu, Lexing Xie, Scott Sanner, Manuel Cebrian, Honglin Yu, and Pascal Van Hentenryck. 2017. Expecting to be HIP: Hawkes intensity processes for social media popularity. In Proceedings of the 26th International Conference on World Wide Web. 735–744.
 Shang and Sun (2018) Jin Shang and Mingxuan Sun. 2018. Local lowrank Hawkes processes for modeling temporal user–item interactions. Knowledge and Information Systems (2018), 1–24.
 Steel (2007) Piers Steel. 2007. The nature of procrastination: A metaanalytic and theoretical review of quintessential selfregulatory failure. Psychological bulletin 133, 1 (2007), 65.

Xiao
et al. (2017)
Shuai Xiao, Junchi Yan,
Xiaokang Yang, Hongyuan Zha, and
Stephen M Chu. 2017.
Modeling the intensity function of point process via recurrent neural networks. In
Thirtyfirst aaai conference on artificial intelligence
.  Yao et al. (2020) Mengfan Yao, Shaghayegh Sahebi, and Reza FeyziBehnagh. 2020. Analyzing Student Procrastination in MOOCs: A Multivariate Hawkes Approach. In The 13th International Conference on Educational Data Mining.
Comments
There are no comments yet.