Machine Learning Based Physical Activity Extraction for Unannotated Acceleration Data UNIVERSITY OF TURKU Department of Computing Master of Science in Technology Thesis Artificial Intelligence May 2021 Tanja Vähämäki Supervisors: Antti Airola Iman Azimi The originality of this thesis has been checked in accordance with the University of Turku quality assurance system using the Turnitin OriginalityCheck service. UNIVERSITY OF TURKU Department of Computing TANJA VÄHÄMÄKI: Machine Learning Based Physical Activity Extraction for Unannotated Acceleration Data Master of Science in Technology Thesis, 87 p. Artificial Intelligence May 2021 Sensor based human activity recognition (HAR) is an emerging and challenging research area. The physical activity of people has been associated with many health benefits and even reducing the risk of different diseases. It is possible to collect sensor data related to physical activities of people with wearable devices and embedded sensors, for example in smartphones and smart environments. HAR has been successful in recognizing physical activities with machine learning methods. However, it is a critical challenge to annotate sensor data in HAR. Most existing approaches use supervised machine learning methods which means that true labels need be given to the data when training a machine learning model. Supervised deep learning methods have outperformed traditional machine learning methods in HAR but they require an even more extensive amount of data and true labels. In this thesis, machine learning methods are used to develop a solution that can recognize physical activity (e.g., walking and sedentary time) from unannotated acceleration data collected using a wearable accelerometer device. It is shown to be beneficial to collect and annotate data from physical activity of only one person. Supervised classifiers can be trained with small, labeled acceleration data and more training data can be obtained in a semi-supervised setting by leveraging knowledge from available unannotated data. The semi-supervised En-Co-Training method is used with the traditional supervised machine learning methods K-nearest Neighbor and Random Forest. Also, intensities of activities are produced by the cut point analysis of the OMGUI software as reference information and used to increase confidence of correctly selecting pseudo-labels that are added to the training data. A new metric is suggested to help to evaluate reliability when no true labels are available. It calculates a fraction of predictions that have a correct intensity out of all the predictions according to the cut point analysis of the OMGUI software. The reliability of the supervised KNN and RF classifiers reaches 88 % accuracy and the C-index value 0,93, while the accuracy of the K-means clustering remains 72 % when testing the models on labeled acceleration data. The initial supervised classifiers and the classifiers retrained in a semi-supervised setting are tested on unlabeled data collected from 12 people and measured with the new metric. The overall results improve from 96- 98 % to 98-99 %. The results with more challenging activities to the initial classifiers, taking a walk improve from 55-81 % to 67-81 % and jogging from 0-95 % to 95-98 %. It is shown that the results of the KNN and RF classifiers consistently increase in the semi- supervised setting when tested on unannotated, real-life data of 12 people. Keywords: human activity recognition, wearable sensors, acceleration data, machine learning, semi-supervised learning, unlabeled data Table of Contents 1 Introduction ................................................................................... 1 1.1 Motivation ............................................................................................................................ 2 1.2 Research questions .......................................................................................................... 3 1.3 Contributions ...................................................................................................................... 3 1.4 Thesis structure ................................................................................................................. 4 2 Background ................................................................................... 5 2.1 Challenges in HAR ............................................................................................................. 5 2.2 Methods used in HAR ....................................................................................................... 6 2.2.1 Filtering ........................................................................................................................................... 6 2.2.2 Segmentation ................................................................................................................................ 6 2.2.3 Feature extraction ...................................................................................................................... 7 2.2.4 Machine learning algorithms ................................................................................................. 8 2.2.5 Evaluation metrics ................................................................................................................... 19 3 Related work ............................................................................... 25 3.1 Machine learning approaches in HAR ..................................................................... 25 3.1.1 Supervised machine learning in HAR .............................................................................. 25 3.1.2 Unsupervised machine learning in HAR ......................................................................... 31 3.1.3 Semi-supervised machine learning in HAR ................................................................... 35 3.2 Summary of related work ............................................................................................ 38 3.3 Open questions in HAR.................................................................................................. 40 4 Extracting activities from Axivity accelerometer device ........ 42 4.1 Data ...................................................................................................................................... 43 4.2 New annotated data ....................................................................................................... 44 4.3 Methodology ..................................................................................................................... 45 4.3.1 Pre-processing of the acceleration data ......................................................................... 46 4.3.2 Feature extraction ................................................................................................................... 47 4.3.3 Applying machine learning algorithms ........................................................................... 50 4.3.4 Cut point analysis of OMGUI software ............................................................................. 51 5 Experiments ................................................................................. 53 5.1 Recording and annotating new data ........................................................................ 53 5.2 Analyzing the new annotated data ........................................................................... 54 5.2.1 Visualisation of the segments ............................................................................................. 54 5.2.2 Positioning of the Axivity device ....................................................................................... 57 5.3 Finding clusters ............................................................................................................... 58 5.3.1 K-means clustering .................................................................................................................. 58 5.3.2 Visualisation of clusters ........................................................................................................ 59 5.3.3 Performance of K-means clustering ................................................................................. 61 5.4 Training supervised classifiers .................................................................................. 62 5.4.1 KNN classification .................................................................................................................... 62 5.4.2 Random Forest classification .............................................................................................. 63 5.4.3 The importance of the features .......................................................................................... 64 5.4.4 Reliability of supervised classifiers .................................................................................. 66 5.5 Improving classifiers in semi-supervised setting ............................................... 72 5.5.1 Reliability of classifiers in semi-supervised setting .................................................. 73 6 Discussion .................................................................................... 77 7 Conclusion ................................................................................... 79 References .......................................................................................... 83 Abbreviations and Acronyms AE Autoencoder ARI Adjusted randomizing index C-index Concordance index CNN Convolutional neural network DBSCAN Density-based spatial clustering of applications with noise DNN Deep neural network GMM Gaussian mixture model GRU Gated recurrent unit HAR Human activity recognition HIER Hierarchical agglomerative clustering KNN K-nearest neighbor algorithm LSTM Long short-term memory MET Metabolic equivalent of task NMI Normalized mutual information NN Neural network PCA Principal Component Analysis RF Random forest algorithm RNN Recurrent neural network SVM Support Vector Machine TCN Temporal convolutional network 1 1 Introduction Sensor-based human activity recognition (HAR) is an emerging and challenging research area. The goal in HAR is to recognize physical activities of people by monitoring their daily lives. It is important to ensure the quality and quantity of physical activity that has been associated with many health benefits like maintaining physical fitness and even reducing the risk of different diseases [5]. The embedded sensors in smartphones, wearable devices and smart environments have made the sensor data stream more accessible, and HAR is used in many real-life applications in areas like health management, smart assistive technologies, and human computer interaction [1]. HAR applications can use the data of wearable devices, such as accelerometers and gyroscopes. The data can be processed by machine learning methods to recognize and analyze physical activities like sitting, walking, and jogging or for example, activities of daily living such as sleeping and doing domestic tasks [2]. The light, non-invasive and low-cost wearable accelerometer devices, such as Axivity accelerometer [11], play a significant role in remote health monitoring [16]. They can continuously and remotely monitor physical activities of the users. The devices collect acceleration values of body movements in three dimensions over time and save the values in X, Y and Z axes in a defined frequency. For example, the accelerometer can be placed on a thigh and the frequency can be set to 100 Hz. In that case, the device collects acceleration values caused by the gravity (9,81 m/s/s) and the movements of the thigh a hundred times per second. Machine learning methods allow extracting information from data. A model is trained based on data using a machine learning algorithm. Machine learning algorithms can learn from the data (and the corresponding true labels of the data) by minimizing the error and maximizing the likelihood of the predictions being true [6]. A good HAR model learns to predict labels and thus, learns to recognize activities, from the new sensor data of the 2 wearable device. The model learns to find patterns in the data related to physical activities performed by the people using the wearable device. The human physical level can be interpreted from the physical, often regular activities, that people perform in their daily lives. For example, the activities can be grouped into sedentary time and light, moderate and vigorous activity. If a machine learning model predicts activities like sleeping and sitting, it can be assumed that these activities correspond to sedentary time or light activity. If jogging is predicted, the activity level of the person has likely been vigorous activity [10]. 1.1 Motivation Various studies in the literature have proposed machine learning methods for HAR applications [1,2,15]. However, it is still an attractive and challenging research topic. The existing approaches mostly use supervised machine learning methods that require annotation, which means that true labels need be given to the data when training a machine learning model. However, the majority of the sensor data has no labels and acquiring annotated sensor data of wearable devices is especially challenging in HAR [1]. It is even more challenging to annotate sensor data for long-term HAR applications. The objective of this thesis is to build a machine learning solution to recognize physical activity (e.g., walking, and sedentary time) from unannotated acceleration data collected using an Axivity accelerometer positioned on a thigh. The solutions are tested on real-life acceleration data collected from 12 people who were asked to wear an Axivity accelerometer on a thigh for one week. Although no true labels and no ground truth are available, the performance and the reliability of the new model should be evaluated. The existing HAR solutions are studied to define the current state of the research related to the task of recognizing physical activities from unannotated sensor data. The characteristics and challenges of HAR and the used approaches to recognize physical 3 activities with different machine learning methods are examined. Approaches that use supervised machine learning with annotated data and unsupervised machine learning with no true labels are studied. Also, semi-supervised learning, that can use both unannotated data and a smaller annotated dataset, is investigated. 1.2 Research questions This thesis aims to fulfil the following research questions: RQ1: Can different activity levels be reliably extracted from an accelerometer device with machine learning using only unlabeled acceleration data? RQ2: Can machine learning models that are trained with new labeled acceleration data from a single person be used to annotate unlabeled acceleration data reliably? RQ3: How can both unlabeled and new labeled acceleration data be used together when extracting activities from unlabeled acceleration data? RQ4: How to get information about the performance of the solution without true labels and the ground truth? 1.3 Contributions In this thesis, the following contributions are made. • The current state-of-the-art HAR studies are reviewed and discussed. 4 • Several types of machine learning solutions are developed based on unsupervised, supervised, and semi-supervised approaches to recognize physical activities in unannotated sensor data of the Axivity accelerometer device. • New acceleration data of one person is gathered and annotated for one week to acquire annotated data for evaluating the performance of the solutions and to study how to benefit from the new annotated data when developing them. • The solutions are tested on unlabeled, real-life data collected from 12 participants of the study. 1.4 Thesis structure The rest of the thesis is organized as follows: Chapter 2 describes challenges and commonly used methods and metrics in HAR. Chapter 3 introduces related work in HAR. Chapter 4 describes the solutions that are developed to extract activities from the Axivity accelerometer device. Chapter 5 explains experiments with the new solutions. In Chapter 6, the results are discussed, and in Chapter 7 conclusions of the study are made. 5 2 Background 2.1 Challenges in HAR Machine learning methods have been successfully used in HAR in areas like healthcare and wellness [5]. However, the existing approaches in HAR mostly use supervised machine learning methods that require annotated data, while the majority of the sensor data has no labels. The annotation of the ground truth is a critical challenge for HAR and may not always be feasible [2]. The number of sensor data records is usually huge. If the sampling rate is for example 100 Hz, the number of records is 360 000 for an hour. It is time consuming to label the records and difficult to remember the activities performed at a specific time. It is especially challenging to assign a correct label for short periods or at the boundary of consecutive activities [8]. Alternative solutions are to use camera-based methods to monitor individuals’ physical activities. However, the methods are privacy- invasive and thus not suitable [13]. Other challenges in HAR are intraclass variability, interclass similarity and class imbalance. The data captured for the same activity from different users of the device may not be similar in nature, for example because of gender or age, and the data related to different activities may be similar, for example for jogging and running. The duration of various activities may differ and cause class imbalance. There are also heterogeneities across the sensing devices and device positioning [2]. In addition, segmenting a continuous data stream and preserving complete activities is difficult. It is challenging to find the precise start and end time of the activities that are not clearly separated by a predefined posture or pause [1]. 6 2.2 Methods used in HAR 2.2.1 Filtering The sensor data that has been collected with wearable devices is usually preprocessed with filtering methods because the raw sensor data is scattered and noisy. In signal processing, a filter is a device or process that removes unwanted parts of the signal such as random noise or components lying within a certain frequency range [20]. Useful signal for HAR usually lies in low frequencies, while noise and random dithering usually lie in high frequencies [23]. For example, Butterworth low-pass filtering is used to keep the frequencies that are important to recognize human physical activities and to discard higher frequencies [20]. 2.2.2 Segmentation To associate a sensor data stream of wearable devices to physical activities, the sensor data needs to be divided into smaller segments of the signals. Each segment can then be labeled and recognized as one physical activity. The sliding window approach is the most widely used segmentation method in HAR because of simplicity and lack of preprocessing. In this approach, a window with a fixed size and a fixed shift slides over the signal data with no inter-window gaps. There may be an overlap between adjacent windows to handle transitions of activities more accurately [19]. The window lengths from 0,08 seconds to 30 seconds are commonly used in HAR [16]. The size of the window is often considered to be a tradeoff between recognition speed and accuracy where small windows allow a faster activity recognition and large windows are beneficial to recognize complex activities. However, very small window lengths may 7 be effective in recognizing activities and should be considered especially in cases when speed is prioritized over the best possible accuracy [19]. There are also activity-defined and event-defined window approaches used in HAR, but they require pre-processing of the sensor data and often laboratory settings. For example, in the activity-defined approach activity changes in the sensor data are detected with methods like analyzing variations of the features or asking feedback from users. In the event-defined approach, specific events are located and used for example with gait analysis detecting heel strikes and toe-offs or with external mechanisms like human supervision [19]. 2.2.3 Feature extraction In traditional machine learning in HAR, the features are manually extracted from the segments of the sensor data. They may include statistical features, such as mean, variance and entropy. The features may be extracted in the time domain, where the data is represented with respect to time, or in the frequency domain, where the data has been transformed into values corresponding frequencies using for example fast Fourier transform [21], discrete cosine transform [22] or wavelet transform [23]. The advantage of these features is that they can be derived from the signal easily and have been effective in the HAR systems [1]. However, this is dependent on human knowledge of the domain and restricts extending the models to other domains [2]. The development of deep neural network (DNN) architectures has allowed learning the features directly from the segments of the raw sensor data without the need to extract the features manually [3]. In DNN, there is an input layer, many hidden layers, and an output layer. The input layer receives the input data, the hidden layers extract patterns within the data, and the output layer produces the results. The layers of DNN can progressively extract higher-level features from the raw input data. However, training DNN models require large volumes of labeled data to get reliable results on new data and not to overfit 8 on the training data. They also need high computational capacity, because they are complex compared to traditional shallow machine learning methods [6]. 2.2.4 Machine learning algorithms Machine learning algorithms that have successfully been used in sensor based HAR are introduced in this chapter. They can be defined as supervised, unsupervised, or semi- supervised methods. In supervised methods, true labels are needed. Unsupervised methods can be applied on unlabeled data. In semi-supervised methods, both unsupervised machine learning with unannotated data, and supervised machine learning with a smaller annotated dataset are used [12]. Semi-supervised methods aim to reduce the need to annotate sensor data and still train models that can make predictions more accurately than unsupervised learning. Deep learning methods are also machine learning methods and can be used in unsupervised, supervised, and semi-supervised machine learning. Deep learning methods work well on unstructured data and achieve higher accuracy than traditional machine learning methods. However, most deep learning methods used in HAR are supervised methods. They need an extensive amount of data to avoid overfitting and acquiring a large volume of labeled data is a challenge in HAR [6]. 2.2.4.1 Supervised machine learning algorithms In supervised machine learning, true labels of the training data set are available. A supervised machine learning algorithm is applied on the training data to make predictions by minimizing the error between the predicted and true labels. The model learns to find patterns in the training data related to the given labels and in this way learns to predict labels for new data [12]. 9 2.2.4.1.1 K-nearest neighbor K-nearest neighbor algorithm (KNN) is an instance-based learning algorithm that predicts labels straight from the data instances in the training data, where the labels of the training data instances are known. The idea is that similar data instances should have similar labels and similarity can be determined with a distance between the instances [9]. In KNN the data instances are represented in a multi-dimensional space where each feature extracted from the data illustrates one dimension. The parameter k (the number of neighbors) is chosen. When the model predicts a label for a new data instance, KNN searches k training data instances that are nearest to the new one. The predicted label is based on majority voting between the labels of the found instances. The parameter k tunes the complexity of the model and the distance can be determined by using any distance metric like Euclidean distance [9]. KNN is a simple algorithm to implement, and it can learn complex nonlinear functions. KNN has reached good accuracy in many domains. However, it has computational and memory complexity and irrelevant features may decrease the accuracy of the model because all features contribute equally to distance [9]. 2.2.4.1.2 Random forest Random forest classifier (RF) is an ensemble of decision tree classifiers illustrated in Figure 1. A decision tree is a hierarchical flow chart algorithm. It uses branches of a tree to describe every possible decision based on the attribute values in the training data. The tree is constructed by decision nodes that symbolize the attributes, branches that mean decisions based on the value of the attribute and leaf nodes that are the labels. Every branch of the tree ends up with a leaf node and the leaf node of the selected branch is the predicted label [7]. 10 In RF, multiple decision tree classifiers are trained simultaneously, and each of them independently predicts labels for the data instances. The idea is that combining independent decision trees increases the stability of the model by reducing variance of the results. The model more unlikely predicts a label incorrectly than a single decision tree. An ensemble of weak classifiers results in a strong classifier [7]. The most commonly used parameters for a RF classifier are the number of trees and maximum depth of the trees. The training data is first randomly divided into subsamples. Features are also randomly selected for the selected number of trees. A decision tree is then formulated from each subsample. The prediction of the label for a new data instance is based on majority voting between the decision trees. The idea behind randomly selecting subsamples and features is to reduce the correlation between the decision tree classifiers in the ensemble helping them to predict labels more independently from each other [7]. Figure 1 Random Forest classifier RF works well with nonlinear data and has low risk of overfitting. It has also achieved good accuracy. RF is quite slow to train but it is fast when making predictions [7]. 11 2.2.4.1.3 DNN architectures In a fully connected DNN, the network consists of fully connected layers: an input layer, many hidden layers, and an output layer. Each successive layer takes the output of the previous layer and feeds the result to the next layer. The result is calculated as a dot product of the input values of the neurons of the layer and the weights that have been calculated to the neurons [12]. Each layer extracts features from the previous layer gradually increasing the abstraction level of the features. The network optimizes the result by iteratively calculating the error of the predictions and recalculating the weights of the neurons with an error backpropagation algorithm [14]. A fully connected DNN is illustrated in Figure 2. Figure 2 A fully connected neural network with tree hidden layers A convolutional neural network (CNN) processes a volume of activations rather than vectors and produces feature maps. The activations of the neurons use convolution operations that extract features to the next layer. In a convolution operation, a convolution unit is shifted step by step across the input values using a weight vector (or a filter) resulting in inputs to the units of the next layer [14]. The CNN has also subsampling layers (or maxpooling layers) that reduce the size of the feature maps. The CNNs can model temporal dependencies in the data when gradually extracting more high-level features from the previous layers to the next ones [12]. A temporal convolutional network (TCN) is a CNN developed for sequential data. TCNs use dilated convolutions that can only use present and past inputs like convolutions in 12 CNNs but can take a sequence of any length in the previous layer and map it to an output sequence of the same length. In this way, an output can represent a wider range of inputs and TCNs can have long effective history sizes [44]. Recurrent neural networks (RNN) can include circles unlike DNNs and CNNs that are feedforward networks. In RNNs, the output depends on both present and past inputs. They can create and process memories of the temporal sequences of the data and mix both sequential and parallel information [14]. The RNN architectures with long short-term memory units (LSTM) or gated recurrent units (GRU) can keep track of internal states that represent the memory of the network. They improve the learning of long time-scale temporal dependences of the sequences and help the system to model more complex patterns [1]. Bi-directional RNNs can be used when both past and future content of the sequences of the data are known in advance. The bi-directional RNN processes the sequences from start to end and from end to start and makes predictions from their combined outputs. The RNNs can also be stacked to create deep RNNs [14]. Attention models have been developed to alleviate RNNs difficulties to learn from long input sequences. They can selectively access the most important parts of the input sequences based on the current contexts instead of accessing the input sequences through fixed size vectors [55]. DNN architectures can learn complex nonlinear functions and have outperformed traditional machine learning methods in accuracy. However, they have significant computational complexity and require large volumes of data for not overfitting when training the models. 13 2.2.4.2 Unsupervised machine learning algorithms In unsupervised machine learning, there are no true labels associated with the training data. The aim is to draw inferences from the data and to model the underlying structure and the distribution [12]. It is assumed that certain patterns occur more often than others related to the output values to be predicted [13]. When hidden patterns are found in the groups of the training data, groups of similar physical activities may have been identified [32]. 2.2.4.2.1 K-means clustering Centroid-based K-means clustering aims to identify clusters of similar data instances. The number of clusters must be defined with a parameter k. The centers of the clusters are first randomly initialized and each data instance in the data is pointed to the cluster, the center of which is closest to it. Then new centers of clusters are computed as a mean vector of the assigned data instances. These two steps are repeated until the centers of the clusters do not change anymore. Like with KNN, different distance measures can be used, most commonly the Euclidian distance [9]. K-means clustering is fast, and it has achieved good accuracy in many domains. However, K-means clustering is sensitive to the initial positions of the centers of the clusters, and it may fail if they are badly initialized. Also, the number of clusters has to be pre-specified which may be challenging [9]. 14 2.2.4.2.2 DBSCAN Density-based spatial clustering of applications with noise (DBSCAN) is a density-based clustering algorithm. The idea is that high data density corresponds to clusters. The given parameters are the initial size of the neighborhood area and the number of data instances that should be in the area. The DBSCAN starts with finding an area according to the parameters. The neighborhood area is expanded as long as the density criteria is satisfied. The area forms a cluster that is removed from the data set. These two steps are repeated until suitable areas cannot be found any more [9]. In DBSCAN the number of clusters is not needed. DBSCAN is efficient, and it has also shown good accuracy. It can find clusters of arbitrary shape, but it is not effective when clusters have varying densities [9]. 2.2.4.2.3 Hierarchical agglomerative clustering Hierarchical agglomerative clustering (HIER) builds clusters hierarchically first considering each data instance as a separate cluster. The two clusters that are closest to each other are joined together. This step is repeated until a suitable number of clusters given with a parameter k are formed. Similarity of data clusters can be calculated for example with Euclidean distance between the centroids or mean value vectors of the clusters [9]. In HIER the number of clusters is not needed. The output of HIER is a dendrogram where the hierarchical relationship of the clusters can be visualized. It is possible to choose suitable clusters also merging subclusters [9]. 15 2.2.4.2.4 Gaussian mixture model A Gaussian mixture model (GMM) is a probabilistic clustering algorithm. GMM optimizes the fit between data and a parametric distribution like a Gaussian or Poisson distribution for each cluster. The data is modeled by a mixture of the distributions. The optimal values for the parameters: a mean, a variance, and a prior probability of the distribution are calculated for each distribution maximizing the likelihood of the data with regards to the model parameters. GMM is a soft clustering method where data instances are not associated only to one cluster, but probabilities of belonging to different clusters are calculated for each data instance [38]. 2.2.4.2.5 Principal Component Analysis In Principal Component Analysis (PCA) dimensionality of data is reduced while trying to retain most of the variation in the data. PCA identifies orthogonal directions called principal components, that maximize the variation of the components. It projects features of data instances to these principal components forming new features that are linear combinations of the original ones. The original features of the data are compressed to fewer features preserving as much variance as possible [9]. PCA is a linear method that is suitable for reducing the number of features and for visualizing data in two or three dimensions [9]. 2.2.4.2.6 Deep learning autoencoders Autoencoders (AE) are an unsupervised technique of neural networks (NN) that can learn compressed knowledge representations of input data. They are a nonlinear generalization of PCA. The task of the AEs is to reconstruct the input data by minimizing the reconstruction error to find structure in the data. First, an encoder encodes the input data 16 to a latent state representation of the data and a decoder reconstructs the representation back to the input data through the network. The aim is to learn a generalizable way to encode and decode data, not just to memorize the input values [31] . In a bottleneck AE architecture, hidden layers have fewer nodes than the input layer forming a bottleneck that forces the network to learn compressed latent state representations of the data [31]. The AE with a bottleneck architecture is illustrated in Figure 3. Figure 3 An autoencoder with a bottleneck architecture In a sparse AE architecture, the number of nodes in the hidden layers is not reduced, but only a small number of nodes are activated to learn compressed latent state representations. This is done with a loss function that penalizes activations within hidden layers. Because the activations depend on the input data different input values activate different nodes through the network [31]. The sparse AE architecture is illustrated in Figure 4. Figure 4 A sparse autoencoder with a restricted number of nodes activated 17 Denoising and contractive AEs aim to learn representations that are robust against noise. In the denoising AE, the input data is slightly corrupted, and the target output is maintained as the original input data. In the contractive AE, a loss function penalizes large derivatives of hidden layer activations with respect to the input data. In this way small changes in the input data maintain similar encoded values and contract a neighborhood of the input values into a smaller neighborhood in the output values [31]. Variational AEs use a probabilistic way to describe values in latent state representations. Instead of giving single values to the attributes in the representation vector, the variational encoder describes a probability distribution for each latent attribute. The encoder builds two output vectors, one describing the mean and the other the variance of the latent state distributions. A vector for the decoder is generated randomly sampling from each latent state distribution. A loss function penalizes the reconstruction error and encourages learning distributions like the true distribution simultaneously. The result is a smooth latent space representation where the outputs are ranges of possible values instead of single values [31]. 2.2.4.3 Semi-supervised machine learning methods In semi-supervised techniques a large amount of unannotated data is used on top of limited annotated data. The idea in semi-supervised learning is that useful information in the unannotated data can be leveraged to learn more effectively from a small set of annotated data [4]. 2.2.4.3.1 Self-learning method Self-learning iteratively uses a supervised machine learning method. A supervised classifier is first trained on a small amount of annotated data, and the classifier is then 18 used to predict pseudo-labels to some or all the unannotated data. Typically, pseudo- labels are given to the most confident predictions. The data with pseudo-labels can then be used together with the annotated data to retrain the classifier and the self-learning procedure is repeated [41]. The challenge in this approach is that the initial model trained with limited annotated data needs to be good [4]. 2.2.4.3.2 Co-learning method Co-learning follows the procedure of self-learning also simultaneously augmenting the training process with an additional source of information. For example, two separately trained classifiers can teach one another by augmenting each other’s training sets with the most confident predictions. The classifiers are retrained, and the process is repeated. In this method, it is assumed that the two separate training sets are sufficient to train the classifiers to make reliable predictions. Also, one classifier’s high confidence data instances need to be independent and identically distributed for the other classifier [41]. 2.2.4.3.3 En-Co-Training and democratic co- learning methods En-Co-Training is like self-learning, but consensus of classifiers determines the confidence of the predictions. Confident predictions are added to a common training set and classifiers are retrained on it. En-Co-Training uses majority voting to make the predictions. In democratic co-learning majority voting is used to make predictions and then for example the most confident labeled samples are added to the separate training sets of the classifiers that disagreed with the majority. In En-Co-Training and democratic co-learning the classifiers can be trained on the same data unlike in co-learning. They rely on the difference between the classifiers instead of different feature sets [35]. 19 2.2.4.3.4 Deep semi-supervised methods Another approach in semi-supervised machine learning is to try to learn class boundaries that are smooth for example with consistency-based methods like denoising AEs. The intuition is that the data should be in the right representation exhibiting clustering, where the classes correspond to the clusters. Because consistency-based methods encourage smooth class boundaries they may not promote clustering that would be needed with very few available labels, though [4]. A ladder network simultaneously trains an AE on unlabeled data and an NN with labeled data. The ladder network consists of a noisy feed forward path (an encoder), a decoder, and a clean feed forward path. The noisy feed forward path and the clean feed forward path share the same mapping function, and the decoder has cost functions on each layer minimizing the difference between the mappings of the noisy and the clean feed forward paths. The output of the noisy feed forward path is also trained with labeled data [47]. Semi-supervised approaches that incorporate pairwise similarity information about different data instances may be used to more explicitly separate classes. For example, Siamese NNs and Triplet networks learn representations from similar/dissimilar pairs [4]. Siamese NNs include dual branches and shared weights between pairs of data instances. They process input pairs and learn pairs of representations, the distances of which can be used to describe the semantic similarity of the pairs [1]. 2.2.5 Evaluation metrics Metrics that are used when evaluating the performance of the solution of this thesis and metrics often used as evaluation metrics of the solutions in HAR are described in this chapter. 20 2.2.5.1 Metrics used in supervised machine learning 2.2.5.1.1 Accuracy Accuracy tells the fraction of correct predictions out of all the predictions of the model. It can also be defined with the terms true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) with the equation 1 below. (1) Accuracy is an often-used metric, but very sensitive to class imbalance [9]. 2.2.5.1.2 F1-score F1-score is a balanced combination of precision and recall and can be calculated with the equation 2. , when (2) Precision indicates the proportion of true predictions among the data instances that have been predicted to belong to the category. Recall that is also called true positive rate or sensitivity defines the proportion of true predictions among the data instances that belong to the category. It shows how well correct categories have been found. The values of F1-score may vary between 0 and 1. Values close to 1 indicate particularly good precision and recall. F1-score is more robust to class imbalance than accuracy [9]. 21 2.2.5.1.3 Sensitivity Sensitivity is also called a true positive rate. It shows the fraction of correct positive predictions out of the data instances that belong to the predicted category. It is also called recall and can be calculated with the equation of recall shown above. 2.2.5.1.4 Specificity Specificity is also called a true negative rate. It tells the fraction of correct negative predictions out of the data instances that do not belong to the predicted category. It can be calculated with the equation 3 [36]. (3) 2.2.5.1.5 Concordance index Concordance index (C-index) calculates how many times the order of the predictions of pairs were correct out of all possible pairs. C-index is a suitable metric to measure the performance of the model on the data where the labels can be interpreted as an ordinal scale of increasing activity levels. The value 0,5 represents a random prediction and value 1 corresponds to the best model prediction. C-index can be calculated with the equation 4 below [34]. , where (4) is the risk score of a unit i 22 2.2.5.2 Metrics used in unsupervised machine learning Some of the metrics that are used in unsupervised machine learning can be calculated without access to true labels and the ground truth of the true clusters such as the Silhouette coefficient. However, many of them require true labels making them useless on data that has no labels. For example, to calculate clustering accuracy, the Adjusted Randomizing Index (ARI), or Normalized mutual information (NMI) at least some labels are needed to present the ground truth of the true clusters. 2.2.5.2.1 Silhouette coefficient The quality of the clustering can be measured for example with the Silhouette coefficient calculated with the following equation 5. , where (5) a(i) is an average distance between i:th data instance and instances in the same cluster and b(i) is an average distance between i:th data instance and instances in the other clusters. In the clusters formed well the data instances are close to the instances in the same cluster and far from those of other clusters. The values of Silhouette coefficient may vary between -1 and 1, values close to 1 meaning particularly good clusters [9]. 2.2.5.2.2 Clustering accuracy Clustering accuracy is a classification accuracy for unsupervised learning. It uses a mapping function to find the best mapping between clusters found by the clustering algorithm and true clusters. This is needed because the algorithm may use different labels from the true labels to represent the same cluster. The clustering algorithm is calculated with equation 6 [39]. 23 , where (6) m is a mapping function, y is a true cluster, and c is a cluster found by the clustering algorithm. 2.2.5.2.3 Adjusted Randomizing Index ARI is computed to evaluate similarity between the clusters found by the clustering algorithm and the true clusters given in the annotation. ARI computes the similarity measure between the clusters by considering all pairs of data instances and counting pairs that are assigned in the same or different clusters. It can be calculated with the following equation 7. , where (7) is the number of instances in cluster i formed by the clustering algorithm and the true cluster j, is the number of instances in the cluster i formed by the clustering algorithm, and is the number of instances in the true cluster j. A value close to 0 means random labeling and a value 1, that the clusters are identical [1]. 2.2.5.2.4 Normalized mutual information NMI measures the mutual information between the cluster assignments and the true clusters, and it is normalized by the average of entropy in them. It can be calculated with the equation 8 below. , where (8) is the number of data instances in cluster i formed by the clustering algorithm and the true cluster j, is the number of instances, is the number of instances in the cluster i 24 formed by the clustering algorithm and is the number of instances in the true cluster j. A value 0 indicates that the clusters found by the clustering algorithm and the true clusters are totally different, and value 1, that they are similar [1]. 25 3 Related work 3.1 Machine learning approaches in HAR Works that have successfully used machine learning methods in supervised, unsupervised, or semi-supervised approaches in sensor based HAR are introduced in the following chapters. The aim is to study the state of the research, especially related to the task of recognizing physical activities from unannotated acceleration data collected with the Axivity accelerometer positioned on a thigh. Possibilities to use a small, annotated dataset of one person are also studied. The summaries of supervised, unsupervised, and semi-supervised approaches in HAR with information about used sensors, types of activities to be recognized, applied methods, and used metrics are shown in Tables 1-3 in the end of each corresponding section. 3.1.1 Supervised machine learning in HAR Traditional supervised machine learning methods that use manual feature extraction have been successful in recognizing human activities from sensor data of wearable devices [15]. However, because of superior performance compared to traditional machine learning there has been a shift towards deep machine learning methods in HAR. CNNs, RNNs and a combination of CNNs and RNNs have been effective in modelling temporal dependencies inherent in sequences captured with sensors of wearable devices [3]. Also, an attention-based framework has been proposed for HAR recently [55]. 26 3.1.1.1 Supervised traditional machine learning approaches A work [50] investigated decision tables, an instance-based learning method IBL and nearest neighbor, C4.5 decision tree, and Naïve Bayes on datasets annotated by 20 persons to recognize 20 daily activities in real-life situations. The persons were wearing five wire-free bi-axial accelerometers and were asked to perform given tasks outside the laboratory setting. Mean, energy, frequency-domain entropy, and correlation features were extracted from segments of 6,7 seconds with 50 % overlap. The C4.5 decision tree showed the best accuracy (84 %) and nearest neighbor the second-best accuracy (83%). When using only two accelerometers, on thigh and wrist or on hip and wrist, the accuracy decreased only slightly. The accelerometer placed on a thigh was the most powerful in recognizing the activities. It was shown to be possible to recognize daily activities with pre-trained classifiers in real-life situations. Some activities appeared to require user- specific data to be recognized accurately. In a work [51] the effectiveness of decision tables, C4.5 decision tree, KNN, Support Vector Machine (SVM), and Naïve Bayes as well as meta-level methods boosting, bagging, plurality voting, and stacking was studied on data collected with an accelerometer near the pelvic region from two persons to recognize standing, walking, running, climbing up the stairs, climbing down the stairs, sit-ups, vacuuming, and brushing teeth. The activities were annotated with the help of a stopwatch. Mean, standard deviation, frequency-domain energy, and correlation were extracted from segments of 5,12 seconds with 50 % overlap. Plurality voting turned out to outperform the other classifiers with accuracy (> 90 %). C4.5 decision tree, KNN, Naïve Bayes, and Bayes Net were compared with accuracy and computational complexity to build an online system to recognize sitting, standing, walking, ascending stairs, descending stairs, and running in a study [49]. Bi-axial acceleration data and light data were collected from six persons, who were wearing a sport watch on various body positions and performing given tasks for 45-50 minutes. 27 Time domain features: empirical mean Y axis, root mean square, standard deviation, variance, mean absolute deviation, cumulative histogram, n’th percentile, interquartile range, zero crossing rate, mean crossing rate, and squared length of X,Y were extracted from segments of 4 seconds. The C4.5 decision tree was chosen, because it achieved the best balance with accuracy (87 %) and computational complexity. Climbing stairs was difficult to distinguish from walking. A work [48] compared decision tree methods CART and ID3, an adaptive neuro-fuzzy inference system ANFIS, Nearest Neighbor, KNN, and Naïve Bayes to recognize daily activities lying, standing, jogging, walking, climbing upstairs, and climbing downstairs on acceleration data collected from twenty-eight healthy adults for one hour. Step count, frequency z axis, frequency x axis, mean of maxima x, angle z, RMS of derivative x, energy y, entropy z, entropy x, and area z were the most frequently selected features from segments of 4 seconds. A Java application was used to annotate the data with markers, descriptions, and timestamps. KNN had the best accuracy (> 96 %) on individual datasets and the CART decision tree showed the best accuracy (> 85 %) on group datasets. The sensitivity of climbing stairs was the lowest with all the methods. In a study [52] SVM, NN, and C4.5 decision tree as well as a model combining them with majority voting were trained on laboratory data and evaluated on data collected in free- living conditions. 52 individuals were wearing a tri-axial accelerometer at the lower back and other accelerometers on various body positions to gather reference information considered as the ground truth, both in a laboratory setting and without supervision. Activities were also annotated in diaries. Mean, standard deviation, kurtosis, skewness, range, cross-axis correlation, accelerometer angle, spectral energy, spectral entropy, peak frequencies, and cross-spectral densities were extracted from the segments of 6,4 seconds. All the models showed good accuracy (> 92 %) on laboratory data but a significant decrease in accuracy (> 72 %) in free-living conditions. Majority voting had the best accuracy (95 % on laboratory data and 75 % in free-living conditions). It was concluded that daily-life data is essential when training and testing classification models in HAR. 28 A study [28] created the publicly available PAMAP2 dataset from sensor data collected with three inertial measurement units containing two tree-axial accelerometers, a gyroscope, and a magnetometer. They were placed on the chest, the dominant arm, and the dominant ankle. In addition, a heart rate monitor was positioned on the chest. 9 people were performing 12 daily and six optional activities following a protocol. In addition, C4.5 decision tree, boosted C4.5 decision tree, bagging C4.5 decision tree, Naïve Bayes, and KNN were compared on the data. Time and feature domain features were extracted from the acceleration data and mean and gradient from the heart rate data in segments of 5,12 seconds with a shift of 1 second. The boosted C4.5 decision tree and KNN reached the best accuracy and F1-score (both > 99 %). A work [53] compared KNN, SVM, GMM, and RF to recognize daily activities with accuracy, F1-score, recall, precision, and specificity. Sensor data was collected with accelerometers worn on the chest, right thigh, and left ankle by six persons who were asked to perform 12 daily activities that were annotated by an observer for 30 minutes. Mean, variance, median, interquartile range, skewness, kurtosis, root mean square, zero crossing, peak to peak, crest factor, range, DC component in FFT spectrum, energy spectrum, entropy spectrum, sum of the wavelet coefficients, squared sum of the wavelet coefficients and energy of the wavelet coefficients, the correlation coefficients of mean, and variance of the norm of each acceleration were extracted from the segments of 1 second with 80 % overlap. A wrapper approach based on RF feature selection had been used to select the features. The KNN and RF reached the best performance with all the used metrics F-score, recall, precision, specificity, and accuracy (near 99 %). 3.1.1.2 Supervised deep learning approaches A generic deep framework based on CNN and RNN was proposed for enhancing recognition accuracy and recognizing increasingly complex physical activities in [43]. The features were automatically extracted from raw sensor data by CNN and temporal dynamics of feature activations were modeled by RNN. Multimodal sensor data could also be fused. The framework was evaluated with the task of recognizing standing, 29 walking, sitting, and lying down and right-hand gestures in the OPPORTUNITY dataset [54] collected in a sensory-rich environment and 10 different hand gestures in the Skoda dataset [26] collected from assembly-line workers in a car production environment. The framework outperformed the previously published results, also CNN approaches, on the OPPORTUNITY and Skoda datasets with F1-score (between 89 % and 95 %). A new study [55] suggested the first purely attention-based deep learning framework for HAR. In addition, a personalization framework was proposed to adapt the model to a specific user acquiring data and labels from the user over time. The framework was evaluated on the HHAR [33], PAMAP2 [28], and USC-HAD [56] datasets with F1-score (70 – 84 %) outperforming RF and the previously published deep learning approaches. Personalization increased the F1-scores (74 – 88 %). It was concluded that purely attention-based models are highly capable of extracting temporal dependencies in sensor based HAR. Table 1 Related work with a supervised machine learning approach Reference Sensors Activities Methods Metrics Ling Bao et al., 2004 [50] Accelerometers 20 daily activities C4.5 decision tree, decision tables, instance-based methods IBL and nearest neighbor, Naïve Bayes Accuracy with C4.5 decision tree 84 %, with nearest neighbor 83 % Nishkam Ravi et al., 2005 [51] Accelerometer lying, standing, jogging, walking, climbing up the stairs, climbing down the stairs boosting, bagging, plurality voting, and stacking with decision tables, C4.5 decision tree, KNN, SVM, and Naïve Bayes Accuracy with plurality voting > 90 % Uwe Maurer et al., 2006 [49] Accelerometers and light sensors of sport watches sitting, standing, walking, ascending stairs, descending stairs, running C4.5 decision tree, KNN, Naïve Bayes, Bayes Net Accuracy with C4.5 decision tree 87 % 30 Luciana C. Jatoba et al., 2008 [48] Accelerometers lying, standing, jogging, walking, climbing upstairs, climbing downstairs Decision tree methods CART and ID3, ANFIS, Nearest Neighbor, KNN, Naïve Bayes Accuracy with CART decision tree 86 %, sensitivity Illapha Cuba Gyllensten et al., 2011 [52] Accelerometer lying down, sitting / standing, dynamic / transitions, walking, running, cycling SVM, NN, and C4.5 decision tree, majority voting Accuracy with majority voting 95 % (lab data) / 75 % (free- living data) Attila Reiss et al., 2012 [28] Accelerometers, gyroscopes, magnetometers 12 daily activities (PAMAP2) C4.5 decision tree, boosted C4.5, bagging C4.5, Naïve Bayes, KNN Accuracy and F1-score with boosted C4.5 decision tree and KNN > 99 % Attal Ferhat et al., 2015 [53] Accelerometers 12 daily activities KNN, SVM, GMM, RF Accuracy with KNN and RF near 99 %, F1- score, recall, precision, specificity Francisco Javier Ordóñez et al., 2016 [43] Accelerometers, gyroscopes, magnetometers 4 locomotion activities and 17 hand gestures in the OPPORTUNITY dataset, 10 hand gestures in the Skoda dataset Combination of CNN and RNN F1-score 89 % - 95 % Davide Buffelli et al., 2020 [55] Accelerometers, gyroscopes, magnetometers Activities of the HHAR (6), PAMAP2 (12), and USC-HAD (12) datasets Attention model F1-score 70 – 84 %, with personalization 74 – 88 % 31 3.1.2 Unsupervised machine learning in HAR Unsupervised methods do not need labeled data to train the model, but they have not been used as much as supervised machine learning methods in HAR. The performance of unsupervised methods has usually been inferior to supervised methods [2]. Research of unsupervised learning in HAR has mostly been conducted in clustering of handcrafted features, in weight initialization in pre-training, and in unsupervised feature learning prior to supervised fine tuning. Some works have been suggested to recognize human activities in an unsupervised manner [3]. DNNs have been used to create clustering-friendly representations and cluster assignments simultaneously for still image data and impressive results have been achieved with unsupervised deep clustering frameworks for computer vision applications. However, they have not been able to exploit the sequential nature of sensor data and learn representations of human activities from raw sensor data of wearable devices [3]. 3.1.2.1 Unsupervised traditional approaches A study [8] investigated DBSCAN, HIER, GMM, and K-means clustering that were applied on means and standard deviations extracted from sensor data of accelerometers and gyroscopes of smartphones. Volunteers were asked to perform five activities common in daily living: walking, running, sitting, standing, and lying down for ten minutes. When the number of clusters was known, GMM showed 100 % accuracy. When the number of clusters was unknown DBSCAN and HIER reached over 90 % clustering accuracy. The Calinski-Harabasz index was used to find an optimal number of clusters to the HIER algorithm. 32 In addition to supervised machine learning methods, the unsupervised methods K-means clustering, GMM and Hidden Markov Model were compared in [53]. The Hidden Markov Model showed the best performance with F1-score, recall, precision, specificity, and clustering accuracy (near 84 %). A study [36] suggested a protein interaction model MCODE to recognize human activities. MCODE, GMM, HIER, centroid-based clustering methods K-means++ and K- medoids, and a graph-based Spectral clustering were compared. They were applied on mean, standard deviation, variance, skewness, kurtosis, correlation, and signal magnitude area features that were extracted from segments of 180 seconds with 75 % overlap of acceleration data obtained with smartphones. To evaluate the results two datasets were collected, one from basketball playing and another from race-walking activities. Video was recorded and used to manually annotate the activities. MCODE was shown to outperform the other models with ARI, FM-index, accuracy (74% – 88 %), recall, precision, specificity, and F1-score on the daily living activities collected by WISDM Lab [37] and the two own datasets. In [57], centroid-based clustering methods K-means, K-mode and CLARANS clustering, a hierarchical BIRCH clustering, and DBSCAN clustering were applied on sensor data from the UCI HAR [25] dataset collected with accelerometers and gyroscopes of smartphones. Features of the time and frequency domain had been extracted from segments of 2,56 seconds with 50 % overlap. K-means and DBSCAN clustering reached the highest clustering accuracy (95 %) also when the number of features was reduced. 3.1.2.2 Unsupervised deep learning approaches A work [32] proposed a deep learning variational AE model for learning representations of human activities. Relative changes of position and orientation were calculated from sensor data of accelerometers and gyroscopes of wristbands as input to a variational AE consisting of bi-directional LSTMs. The model was evaluated on data collected and 33 annotated in laboratory-based sessions with 10 persons and the epileptic patients’ daily activities of the public HHAR dataset [33]. The supervised classifiers, a decision tree classifier C4.5, KNN and RF, were applied on the embedded mean vector of the variational AE. They outperformed those applied on hand-crafted features with F1-score. The unsupervised model reached a clustering accuracy higher than 87 %. Recently, the first unsupervised, standalone, end-to-end deep clustering method Deep Sensory Clustering [3] was suggested to recognize human activities straight from raw sensor data of wearable devices. A recurrent AE with bi-directional GRUs and with reconstruction and future prediction objectives, and centroid-based Cluster assignment hardening were jointly used to learn clustering-friendly representations and to generate soft cluster assignments. The approach was compared with K-means clustering, HIER, and end-to-end deep clustering for still images on the public datasets UCI HAR [25], Skoda [26] and MHEALTH [27]. They showed consistent improvement of performance with metrics of clustering accuracy (53 % – 75 %) and NMI. Unsupervised Embedding Learning for HAR [1] using deep learning AE architecture was also recently suggested for unsupervised clustering in HAR. Mean, variance, standard deviation, median value, largest value, smallest value, and interquartile range features were extracted from raw sensor data as input to AE with objectives to minimize reconstruction, temporal coherence, and locality preserving losses. K-means clustering was then applied on the learned representations to find cluster assignments. The approach was compared with PCA and the traditional AE on the public datasets PAMAP2 [28], REALDISP [29] and SBHAR [30] with metrics of clustering accuracy (71 % – 92 %), ARI and NMI showing improved performance. Table 2 Related work with an unsupervised machine learning approach Reference Sensors Activities Methods Metrics Yongjin Kwon et al., 2014 [8] Accelerometers and gyroscopes of smartphones Walking, running, sitting, standing, lying down DBSCAN, HIER with Calinski– Harabasz index, K- means clustering, GMM Clustering accuracy with DBSCAN and HIER > 90 %, NMI 34 Attal Ferhat et al., 2015 [53] Accelerometers 12 daily activities Hidden Markov Model, K-means, GMM Accuracy with Hidden Markov Model near 84 %, F1-score, recall, precision, specificity Yonggang Lu et al., 2017 [36] Accelerometers of smartphones Basketball playing, race walking, daily activities of the WISDM dataset (6) MCODE, GMM, HIER, K-means++, K-medoids, Spectral clustering Clustering accuracy with MCODE 74 – 88 %, ARI, FM-index, recall, precision, specificity, F1- score Jue Wang et al., 2018 [57] Accelerometers and gyroscopes of smartphones 6 daily activities of the UCI HAR dataset K-means , K-mode, CLARANS, BIRCH, DBSCAN Clustering accuracy with K-means and DBSCAN 95 % Lu Bai et al., 2019 [32] Accelerometers and gyroscopes of wristbands 9 daily activities, epileptic patient daily activities (6) of the HHAR dataset Deep learning variational AE with bi-directional LSTMs Clustering accuracy > 87 %, F1-score Alireza Abedin et al.,2020 [3] Wearable devices Activities of the UCI HAR (6), Skoda (10), and MHEALTH (12) datasets End-to-end deep learning RNN AE with bi-directional GRUs and Cluster Assignment Hardening Clustering accuracy 53 – 75 %, NMI Sheng Taoran, 2020 [1] Wearable devices Daily and sport activities of the PAMAP2 (12), REALDISP (33), and SBHAR (6) datasets Deep learning AE with temporal coherence and locality preserving loss and K-means clustering Clustering accuracy 71 – 92 %, ARI, NMI 35 3.1.3 Semi-supervised machine learning in HAR Relatively little work has been conducted with semi-supervised machine learning in HAR [4]. Semi-supervised approaches that use hand-crafted features have been applied to reduce the required amount of annotated training data [40]. Most research on semi- supervised learning in HAR has used sequential AEs to learn representations from unlabeled sensor data to improve supervised classification [4]. Although impressive classification performance has been achieved with semi-supervised learning in computer vision using denoising AEs with class-preserving augmentations, semi-supervised learning is challenging in HAR. The data segments in the sequential data should map to the clusters, but the boundaries of the segments are not known. In addition, class-preserving augmentations, such as rotation and mirroring with images, are difficult to define in HAR [4]. 3.1.3.1 Semi-supervised traditional machine learning approaches A study [41] explored self-learning and co-learning with a supervised method joint boosting on the sensor data in the PLCouple1 dataset [42]. The data was collected with accelerometers on the dominant wrist, the dominant hip, and the non-dominant thigh and 10 infra-red sensors. The male’s daily activities, actively watching tv or movies, dishwashing, eating, grooming, hygiene, meal preparation, reading paper/book/magazine, using computer, and using phone, had been annotated for 15 days with the help of an audio-visual recording system. Mean, variance, energy, spectral entropy, area under curve, pairwise correlation between the three axes, and the first ten FFT coefficients were extracted from segments of 30 seconds with 50 % overlap from the acceleration data. The number of activations of the infra-red sensors were also 36 calculated as features. Both self-learning and co-learning improved the accuracy of the classifier. Co-learning using two types of sensors reached the best accuracy (40 % when the number of used labels was 2,5 %) compared to self-learning and supervised training. In [35] semi-supervised methods self-learning, En-Co-Training, and democratic co- learning were compared to find suitable methods to augment a HAR classifier with new unlabeled data after it had been deployed in a mobile device. The mean, variance, and the FFT coefficients between 1 and 10 Hz were extracted from the segments of one second from acceleration and GPS speed data of smartphones worn by 17 participants staying in one place, walking, and running for 90 minutes. It was shown that En-Co-Training and democratic co-learning performed well when the accuracy of the initial classifier was low, between 75 – 80 %. When the initial accuracy was high, 90 %, the methods did not improve the accuracy of the initial classifiers but did not decrease the accuracy either. Self-learning did not significantly improve the accuracy of the initial classifier. Democratic co-learning was nearly as good as active learning, where a user is asked to label the least confident predictions. It was able to improve the initial accuracy from 84 % to 90 %. 3.1.3.2 Semi-supervised deep learning approaches A work [40] presented two semi-supervised CNN methods, a denoising CNN AE with a supervised CNN and a convolutional ladder network, for recognizing human activities from both labeled and unlabeled raw sensor data split into segments of 1 second with 50 % overlap. Both models outperformed a supervised CNN classifier pretrained with unlabeled data, self-learning with logical regression, and a pseudo-label method on the public ActiTracker [46], the PAMAP2 [28], and MHEALTH [27] datasets with F1-score (> 75 % when the number of the labels was 1 %). It was shown that adjusting low-level features based on unlabeled data in the CNN AE and the convolutional ladder network improved the high-level features. 37 A new semi-supervised sequence classification approach [4] through change point detection was suggested to learn representations that incorporate pairwise similarity information about data instances in both unlabeled and labeled sensor data. The segments between the change points were classified similarly and adjacent segments on opposite sides of the change points were classified differently. Similar and dissimilar pairs were fed to TCN resulting means of empirical distributions that were used as representations of the data. The learned representations were shown to outperform the representations learned by a denoising AE in a semi-supervised setting using a DNN classifier. The models were tested on simulated and real datasets the HCI [45] and the WISDM [37] with F1-score (65 % when the number of the labels 3 %). Also, the results were close to the results of training a supervised classifier on the learned representations. A semi-supervised approach using an AE and a Siamese NN [1] was also recently proposed for HAR. Unsupervised temporal and feature consistency criteria were used through the AE, and weakly supervised label consistency criteria with pairwise constraints was used through the Siamese NN on a mean, variance, standard deviation, median, and interquartile range extracted from raw sensor data. K-means clustering was applied on the learned clustering-friendly representations. The model outperformed the unsupervised Embedding Learning for HAR [1] and the supervised methods RNN with LSTM, CNN, DNN, SVM, C4.5 decision tree, and a boosted C 4.5 using 10 % of the labeled data on the PAMAP2 dataset [28] with a clustering accuracy (99 %). When the number of the labels was 5 % the model reached a clustering accuracy 97 %. Table 3 Related work with a semi-supervised machine learning approach Reference Sensors Activities Methods Metrics Maja Stikic et al., 2008 [41] Accelerometers, infra-red sensors 9 daily activities of the PLCouple1 dataset Self-learning and co- learning with joint boosting Accuracy 40 % when labels 2,5 % Brent Longstaff et al., 2010 [35] Accelerometer and GPS speed of smartphones Staying in one place, walking, running Self-learning with C4.5 decision tree, En-Co-Training, and democratic co- learning with C4.5 Accuracy 90 % when initial accuracy 84 % 38 decision tree, Naïve Bayes and SVM Ming Zeng et al., 2018 [40] Accelerometers, gyroscopes, magnetometers, temperature, heart rate data, ECG data Daily and sport activities of the ActiTracker (6), PAMAP2 (12), and MHEALTH (12) datasets Denoising CNN AE with supervised CNN, Convolutional ladder network F1-score > 75 % when labels 1 % Nauman Ahad et al., 2020 [4] Accelerometers, gyroscopes Gesture recognition of the HCI dataset (5), daily activities of the WISDM (6) dataset TCN with Change point detection and DNN F1-score 65 % when labels 3 % Sheng Taoran, 2020 [1] Accelerometers, gyroscopes, magnetometers, temperature, heart rate data, ECG data The PAMAP2 (12) dataset AE with Siamese NNs with temporal, feature, and label consistency criteria and K-means clustering Clustering accuracy 97 % when labels 5 % and 99 % when labels 10 % 3.2 Summary of related work Although various machine learning approaches have been successfully used in sensor based HAR, most of the works have used supervised machine learning methods that require all the training data to be labeled. For example, traditional machine learning methods such as decision trees used in [48-50], a boosted decision tree and KNN used in the work [28], and KNN and RF in [53] achieved good accuracy and outperformed supervised methods like SVM and Naïve Bayes reaching accuracies over 80 % up to 99 %. Supervised deep learning approaches such as the combination of CNN and RNN [43] and the recently proposed attention based NN [55] outperformed the traditional methods and were able to recognize complex activities more accurately. But the supervised deep 39 learning approaches need even higher volumes of labeled training data and are not feasible methods in this thesis. Unsupervised methods can find patterns in unlabeled data and promising results have been achieved with traditional unsupervised approaches such as K-means clustering and DBSCAN in a work [57], and DBSCAN and HIER in [8] with over 90 % accuracy. Also, the recent unsupervised deep learning approaches in works [1,3,32] were able to successfully use deep AE frameworks on sequential sensor data of wearable devices with accuracy up to 92 %. However, the performance of unsupervised methods has been inferior to supervised methods. In addition, also with unsupervised machine learning, at least some labeled data is required to present the ground truth to evaluate the performance of the model. Some works have proposed semi-supervised machine learning methods using both unsupervised methods on a large amount of unlabeled data and a small, labeled data set in HAR. For example, a work [41] improved the accuracy of the initial classifier with self-learning and co-learning and a study [35] improved the initial classifier with En-Co- Training and democratic co-learning from 84 % to 90 %. Deep AEs have been used to learn representations to improve the performance of a supervised classifier. For example, a denoising CNN AE and a supervised CNN classifier, and a convolutional ladder network were studied in [40] achieving 75 % F1-score. The recent work [1] proposed AE with Siamese NN with temporal, feature, and label consistency criteria followed by K- means clustering. It achieved 97 % accuracy when the number of the labels was 5 % and 99 % accuracy when the number of the labels was 10 % of all the training data. Unlike in most previous works, there are no available labels related to the data that is used in this thesis to present the ground truth. So, there is no direct way to use supervised or semi-supervised machine learning methods or even to evaluate the performance of the unsupervised methods comparing the results with the true clusters. A new dataset of one user is collected and annotated to be able to evaluate the performance of the unsupervised machine learning methods and also to be able to use supervised and semi-supervised methods when recognizing activities from the original unannotated acceleration data. 40 Another difference is that the data used has been collected with only one sensor, a tri- axial accelerometer positioned on the thigh of the participants. Based on a work [50] where it was shown that a sensor positioned on a thigh was the most powerful to recognize physical activities, it is assumed that it is possible to recognize basic activities like sleeping, sitting, sitting in a car, walking around, taking a walk, and jogging from the data collected with the Axivity accelerometer positioned on a thigh. 3.3 Open questions in HAR The challenge of annotating sensor data in HAR and a large amount of continuously streaming unlabeled data has increased the interest in methods that help to reduce the need for labeled data. In semi-supervised machine learning a small, labeled dataset is used together with a large amount of unlabeled data, but also other methods have been studied in HAR to train classifiers with less labeled training data. In active learning, a user is only asked to label the training data instances that the classifier has not been able to classify with high confidence. In transfer learning, on the other hand, a pre-trained classifier can be used and only fine-tuned with a small amount of labeled data that has been collected for example from other persons, by other types of sensors or in a different environment [2]. Another challenge in HAR, intra-class variability between people, but also in a data stream of one person, is also a current research area in HAR. The sensor data of different people typically has variations within the same activities, and sensor data of one person does not stay static over time either. Change of existing activities and also emergence of new activities can be expected. How to adapt a model that has been trained on sensor data of a group of people to be able to better recognize activities of other persons and also from the evolving data stream of the same person is actively studied in HAR. The aim is to personalize a user-independent model to increase its accuracy when recognizing activities from an individual data stream and also adapt it with evolving activities [58]. 41 Using mobile devices with limited resources to recognize human activities has also become an active research area in HAR. Sensors can be embedded in mobile devices like smartphones that either transmit the data and receive the results via the backend server, where the HAR model is applied, or the HAR model is implemented directly on the mobile device. The latter has become a feasible option because of the improved computational power of the devices. In a mobile real-time activity recognition both time and accuracy are key criteria for measuring performance of a HAR model. An interesting possibility is also to aggregate recognized activities from users’ devices on a high-level platform like the cloud to be used and studied together with other information for example related to a location. In context aware activity recognition, the aim is also to leverage information from the context of the surrounding environment to recognize higher level and more complex activities more accurately [58]. Incremental and active learning has become a new and promising research area in HAR. In this approach, an initial model is trained on a small amount of labeled data and then the model is continuously accumulated with incremental and active learning only asking labels for informative samples in a continuous data stream [58]. In incremental learning, a model is not retrained with new data, but only incrementally updated to adapt the model to new instances in a data stream. Incremental learning without any user interaction has also been suggested in HAR. In this approach only the predicted labels of the model are used when updating the model. However, this kind of totally autonomous learning can lead to concept drift and incorrect predictions [24]. 42 4 Extracting activities from Axivity accelerometer device First, to answer the research question RQ1: “Can different activity levels be reliably extracted from an accelerometer device with machine learning using only unlabeled acceleration data?” the unsupervised machine learning algorithm K-means clustering is applied on the unlabeled acceleration data because it has shown good performance in the research of HAR [57]. The aim is to find clusters that would correspond to physical activities to be recognized. To be able to evaluate the reliability of these unsupervised methods new acceleration data is recorded with the Axivity device and annotated. K- means clustering is applied on data containing both unannotated and new, annotated data. The assigned clusters of the annotated data can then be compared with the true labels given in the annotation. Next, to find an answer to the research question RQ2: “Can machine learning models that are trained with new labeled acceleration data of one person be used to annotate unlabeled acceleration data reliably?” the supervised machine learning algorithms, KNN and RF are applied on the new, annotated acceleration data. The KNN and RF have shown competitive performance compared to other traditional supervised methods in HAR [28, 53]. The aim is to train two separate classifiers that can predict physical activities from unannotated acceleration data. To answer the research question RQ3: “How can both unlabeled and new labeled acceleration data be used together when extracting activities from unlabeled acceleration data?”, the previously trained supervised classifiers are used with the En-Co-Training method in a semi-supervised setting. The En-Co-Training is used like in [35], but together with two classifiers and making separate predictions by the classifiers instead of majority voting. In addition, the cut point analysis of the OMGUI software [18] is performed. The activity levels produced by the cut point analysis are used as reference information to increase confidence of selecting correct pseudo-labels. The aim is to leverage knowledge 43 from the unannotated data and to improve the classifiers to better generalize on the data collected from other users of the Axivity device. Finally, the research question RQ4: “How to get information about the performance of the solution without true labels and the ground truth?” is examined. A new metric is proposed. It calculates a fraction of correctly predicted activity levels out of all the predictions also according to the cut point analysis of the OMGUI software. The new metric is used to get reference information about the reliability of the classifiers to predict activities from unannotated acceleration data. The Jupyter Notebook IDE, Python version 3.6.8, Scikit Learn Library version 0.20.3 and Scipy Library version 1.2.1 are used when implementing the solution and performing experiments with the data. 4.1 Data The data of this study has been collected with the Axivity accelerometer device from 12 people, who were asked to wear an Axivity accelerometer on a thigh for one week. The individuals were between 27 and 46 of age. The physical activity rate during the week, age, weight, and height were also asked from them. Table 4 shows the background information of the participants. Table 4 Participants’ background information Characteristics Values Age (years), mean (SD) 36,8 (5,4) BMI, mean (SD) 23,0 (2,5) Physical activity during the week, n (%) Rarely A few times a week Almost every day 3 (25) 5 (42) 4 (33) 44 The acceleration data of the Axivity device is first converted from binary files to CSV files in units of g (=9.81 m/s/s) with the OMGUI software [18]. A CSV file is created from each day of a participant. The data consists of timestamps and acceleration values of X, Y and Z axes that have been recorded in the frequency of 100 Hz. Thus, there are 360 000 recordings per hour and about 9 million recordings per day. The acceleration values of X, Y and Z axes of one user for one day is shown in Figure 5. Figure 5 Sensor data from Axivity accelerometer in X, Y and Z axes for one day 4.2 New annotated data To obtain annotated data more acceleration data of one person is recorded and labeled for one week. The true labels are saved in a note application of a mobile phone at a minute level and then converted to Excel files. The aim is to label basic daily activities that could be reliably recognized also with traditional machine learning methods. The activity types should also cover all the activities performed during the week. In addition, it should be easy to compare the activity types with the activity levels later produced by the cut point analysis of the OMGUI software [18]. 45 The activities are annotated using the following labels: 0 = sleeping, 1 = sitting, 2 = sitting in a car, 3 = walking around and doing tasks, 4 = doing workout, 5 = taking a walk, 6 = jogging, 9 = a break in the annotation, 10 = to be automatically annotated that will be used with the unannotated data. The label 4 is combined with the label 3, because the results of both labels seem to be close to each other in the analysis. The labels can be interpreted as an ordinal scale of increasing activity levels. Sleeping, sitting, or sitting in a car correspond sedentary time or light activity. Walking around and doing tasks can be interpreted as sedentary time, light, or moderate activity. Taking a walk should be light or moderate activity and jogging should be vigorous activity [10]. The activity types and the corresponding activity levels are shown in Table 5. Table 5 Activity types and corresponding activity levels Label Activity type Activity level 0 Sleeping Sedentary time / Light activity 1 Sitting Sedentary time / Light activity 2 Sitting in a car Sedentary time / Light activity 3 Walking around and doing tasks Sedentary time / Light activity / Moderate activity 4 Workout (will be combined with the label 3) Sedentary time / Light activity / Moderate activity 5 Taking a walk Light activity / Moderate activity 6 Jogging Vigorous activity 9 A break in the annotation 10 To be annotated (will be used with the unannotated data) 4.3 Methodology The process used in this thesis follows the steps commonly used in the HAR process: 1) data collection 2) preprocessing of sensor data 3) feature extraction and 4) applying 46 machine learning algorithms. The result is 5) a model that can recognize activities from new sensor data [1]. The HAR process is shown in Figure 6. Figure 6 Process of human activity recognition 4.3.1 Pre-processing of the acceleration data 4.3.1.1 Segmentation The acceleration data of X, Y and Z axes, that has been collected with the Axivity device, is split into consecutive segments to separate different activities in the sensor data stream so that each segment can be labeled and recognized as one physical activity. The lengths of the segments of 1, 5 and 10 seconds are tested, and the length is set to 10 seconds. It seems to be a suitable window size for recognizing the previously chosen activity types. The timestamp of each segment is compared to the timestamp of the annotation data. If the annotation is 9 (= a break in an annotation), the segment is not processed further, but discarded. Otherwise, the segment will be further processed. 47 4.3.1.2 Butterworth low-pass filtering The acceleration data of the segments is filtered because the raw sensor data is scattered and noisy. Butterworth low-pass filtering [20] is used to keep the low frequencies that are important to recognize human physical activities and to discard higher frequencies. The order is set to 4 and the cutoff frequency is set to 10 Hz. The order of the Butterworth filtering affects the sharpness of the cutoff. The higher the order is the sharper the cut-off frequencies are. 4.3.2 Feature extraction 4.3.2.1 Time domain features Features are extracted from each filtered segment of the sensor data because they are more effective for separating different activities than the sensor data. A set of statistical features are first extracted from the segments in the time domain, where the segments are represented with respect to time like in the original sensor data stream. The features, that are extracted from the filtered segments in the time domain, are shown in Table 6. The following statistical features: mean, median, standard deviation, largest value, smallest value, interquartile range, skewness, kurtosis, and root mean square, are calculated from the filtered acceleration values of each segment from the X, Y and Z axes separately. Also, peak prominences, that measure how much the peaks of the signal stand out from the surrounding baseline, and peak widths in the middle of the peak heights and contours are calculated and summarized from the segments of each axis. Approximate entropy is also calculated to quantify the amount of regularity of fluctuations in the filtered acceleration values of the segments. The smaller the approximate entropy is the more regular the signal is in the segment. 48 In addition, Pearson correlation coefficients between the axes X and Y, X and Z, and Y and Z are calculated from the segments and signal vector magnitudes are calculated to describe the intensity of the movements from the filtered acceleration values of the X, Y and Z axes from each segment with the equation 9 below. (9) Table 6 Features extracted from the segments in the time domain Axis Extracted features X axis X Mean, X Median, X Standard deviation, X Largest, X Smallest, X Interquartile range, X Skewness, X Kurtosis, X Root Mean Square, X Peak prominences sum, X Peak widths sum, X Approximate entropy Y axis Y Mean, Y Median, Y Standard deviation, Y Largest, Y Smallest, Y Interquartile range, Y Skewness, Y Kurtosis, Y Root Mean Square, Y Peak prominences sum, Y Peak widths sum, Y Approximate entropy Z axis Z Mean, Z Median, Z Standard deviation, Z Largest, Z Smallest, Z Interquartile range, Z Skewness, Z Kurtosis, X Rot Mean Square, Z Peak prominences sum, Z Peak widths sum, Z Approximate entropy Several axes Pearson correlation (X, Y), Pearson correlation (X, Z), Pearson correlation (Y, Z), Signal vector magnitude 4.3.2.2 Frequency domain features Statistical features are also extracted from the segments in the frequency domain. The filtered acceleration data is transformed from the time domain to the frequency domain to show how much of the signal lies within each given frequency band over a range of frequencies. FFT is used to transform the signal data of the segments, that is represented in respect to time, to the magnitude values of the frequency content of the signal. The features, that are extracted from the segments in the frequency domain, are shown in Table 7. 49 The following statistical features are calculated from the magnitude values of the frequency content: mean, median, standard deviation, largest value, smallest value, interquartile range, skewness, kurtosis, and root mean square. Power spectral densities, that measure the signal’s power content versus frequency, are calculated for the frequencies from 0 to 10 Hz, within frequency bins of 1 Hz, and the dominant power spectral densities are calculated from the segments of the axes. Normalized spectral entropy is calculated to measure the uniformity of the power spectral densities in the segments of the axes. The smaller the normalized spectral entropy is the more uniform the power spectral densities are in the segment. Table 7 Features extracted from the segments in the frequency domain Axis Extracted features in the frequency domain X axis X Magnitudes mean, X Magnitudes Median, X Magnitudes Standard deviation, X Magnitudes Largest, X Magnitudes Smallest, X Magnitudes Interquartile range, X Magnitudes Skewness, X Magnitudes Kurtosis, X Magnitudes Root Mean Square, X PSD (Power Spectral Density) 0, X PSD 1, X PSD 2, X PSD 3, X PSD 4, X PSD 5, X PSD 6, X PSD 7, X PSD 8, X PSD 9, X PSD10, X Dominant PSD, X Normalized Spectral entropy Y axis Y Magnitudes mean, Y Magnitudes Median, Magnitudes Standard deviation, Y Magnitudes Largest, Y Magnitudes Smallest, Y Magnitudes Interquartile range, Y Magnitudes Skewness, Y Magnitudes Kurtosis, Y Magnitudes Root Mean Square, Y PSD (Power Spectral Density) 0, Y PSD 1, Y PSD 2, Y PSD 3, Y PSD 4, Y PSD 5, Y PSD 6, Y PSD 7, Y PSD 8, Y PSD 9, Y PSD10, Y Dominant PSD, Y Normalized Spectral entropy Z axis Z Magnitudes mean, Z Magnitudes Median, Z Magnitudes Standard deviation, Z Magnitudes Largest, Z Magnitudes Smallest, Z Magnitudes Interquartile range, Z Magnitudes Skewness, Z Magnitudes Kurtosis, Z Magnitudes Root Mean Square, Z PSD (Power Spectral Density) 0, Z PSD 1, Z PSD 2, Z PSD 3, Z PSD 4, Z PSD 5, Z PSD 6, Z PSD 7, Z PSD 8, Z PSD 9, Z PSD10, Z Dominant PSD, Z Normalized Spectral entropy 50 4.3.2.3 Standardization All the extracted features are standardized with Z-score standardization to change the values of the features to a common scale so that the mean value will be 0 and the standard deviation will be 1 with the equation 10 below. The standardization prevents the features with a larger scale from dominating in machine learning algorithms. , where (10) is the mean and is the standard deviation. 4.3.3 Applying machine learning algorithms The unsupervised machine learning algorithm K-means clustering is applied on the time and frequency domain features extracted from the segments of the unannotated acceleration data to study if K-means clustering can find clusters with similar features. The similar features between the segments of the data would suggest that the activity types of the segments could also be the same. The supervised machine learning methods, KNN and RF, that are suitable to be used on a small amount of data, are applied on the time and frequency domain features extracted from the segments of the new, labeled acceleration data. The aim is to study, if the trained KNN and RF models can be used to reliably predict labels from the original unannotated data collected from the participants of the study. In addition, the semi-supervised method En-Co-Training is used with the KNN and RF models to leverage knowledge from the unannotated acceleration data and to improve the generalization performance of the models that have only been trained on the labeled data 51 of one person. The aim is to study if activities can be predicted more reliably from the original unannotated acceleration data using both unlabeled and labeled data in a semi- supervised setting. 4.3.4 Cut point analysis of OMGUI software The cut point analysis of the OMGUI software [18] is performed to produce activity levels from the unannotated sensor data for reference information. The activity levels of the cut point analysis can be compared to the labels that are predicted by the KNN and RF models, and the comparison can help the human evaluation of the predicted labels without knowing the true labels and the ground truth. The cut point analysis of the OMGUI software produces the following activity levels: 0 = sedentary time, 1 = light activity, 2 = moderate activity and 3 = vigorous activity based on the approach proposed in [17]. It predicts energy expenditure of a person given in units of a metabolic equivalent of task (MET) based on mean signal vector magnitude values that are extracted from segments of acceleration data. It calculates the signal vector magnitudes also subtracting the gravity 1 m/s/s with the equation 11 below and sets the thresholds between the activity levels to 1,5 MET, 4 MET and 7 MET as suggested in [17]. The activity levels of the cut point analysis are shown in Table 8. (11) Table 8 Activity levels produced by the cut point analysis of OMGUI software Label Activity level Measurement 0 Sedentary time < 1.5 MET 1 Light activity >= 1,5 MET, < 4 MET 2 Moderate activity >= 4 MET, < 7 MET 3 Vigorous activity >= 7 MET 52 MET measures the amount of oxygen consumed per kilogram of body weight per minute. 1 MET means that a person consumes approximately 3,5 millilitres of oxygen per kilogram of body weight in a minute, which is roughly equivalent to being at rest. The energy expenditure may differ between persons based on several factors, for example age and fitness level, but thresholds can be set to approximate the difference between different activity levels [10]. In the interface of the cut point analysis tool the predictions are chosen to be made every minute. A fourth-order Butterworth band-pass filtering between 0,5 and 20 Hz is chosen to be used. The position of the device is chosen to be on a hip instead of on a wrist because it better corresponds to the true position on a thigh. Although the cut point analysis predicts the activity levels of the segments based on a single feature the result of the analysis is still interesting. It is assumed that the predicted activity levels can help to evaluate the reliability of the solution that is implemented in the thesis. The signal vector magnitude is shown to correlate to the intensity of the physical activity or the activity level well [17] and the activity level should relate to the activity types that are predicted by the models [10]. 53 5 Experiments 5.1 Recording and annotating new data To evaluate the result of the unsupervised method K-means clustering, new acceleration data is recorded with the Axivity device and annotated for one week. The activities are carefully annotated at a minute level, which is the same level as will be used in the cut point analysis of the OMGUI software [18]. The same level of annotation makes it easy to compare the results later. It is first quite challenging to make annotations in practice at a minute level, but a practical way is found with a note application of a mobile phone and a systematic way to annotate the activities. It is sometimes difficult to remember the activities performed every minute, especially for short periods and to recognize, when the activity has changed to another exactly. The most practical way is to use the label 9 (= a break in the annotation) for the time periods, when the annotation has not succeeded for some reason or the device has been taken off for example because of taking a shower. The true labels are saved in a note application of a mobile phone. The labels are given every time a new activity begins as exactly as possible. No other labels are given to keep the amount of the labels as small as possible. The labels in a note application are then converted to Excel files. The annotations are quality checked comparing them to the corresponding activity levels of the cut point analysis of the OMGUI software to find and correct clear misspellings in the annotation. 54 5.2 Analyzing the new annotated data 5.2.1 Visualisation of the segments The new acceleration data that has been annotated is first pre-processed with segmenting and filtering, and the features are extracted from the segments of the X, Y and Z axes. Filtered segments of X, Y and Z axes of different activities are first plotted. Also peaks of the filtered segments of the Y axis and the contour heights of the peaks, and magnitude values of the frequency content of Y axis are plotted to visually examine possible differences between the annotated activities. There seem to be clear differences between the segments annotated as different activities. The visualization of each activity type is shown in Figures 7-9. Sleeping Sitting 55 Sitting in a car Walking around and doing tasks Taking a walk Jogging Figure 7 Filtered segments of the X, Y and Z axes Sleeping Sitting 56 Sitting in a car Walking around and doing tasks Taking a walk Jogging Figure 8 Peaks and contour heights of the filtered segments of the Y axis Sleeping Sitting 57 Sitting in a car Walking around and doing tasks Taking a walk Jogging Figure 9 Magnitude values of segments of the Y axis in the frequency domain 5.2.2 Positioning of the Axivity device The positioning of the Axivity device on the thigh is checked with the median of the acceleration values of the axes in each segment. The axis, the median of which is closest to 1 (or -1) i.e., the gravity (9.81 m/s/s), is the vertical axis and the others are horizontal axes in the segment. For example, in the segments that have been labeled as sitting or sitting in a car, the median of Z axis is close to -1 and the vertical axis is Z. The median of X axis is close to -1 or 1 in the segments labeled as sleeping. In addition, the median of Y axis is close to 1 in the segments labeled as walking and doing tasks, taking a walk, or jogging. 58 If the vertical axis seems to differ in the segments labeled as the same activity, it should be considered, if the positioning of the Axivity device has changed during recording. Then a conversion of the axes may be needed to keep the acceleration data comparable in the analysis. 5.3 Finding clusters 5.3.1 K-means clustering The unsupervised machine learning method, K-means clustering, is first used to find clusters in the data including both unannotated acceleration data that has been collected from participants of the study and new annotated acceleration data. The clusters with similar features could correspond to similar activity types performed by the users of the Axivity accelerometer device. The data of one day from eight users each and four days of the annotated data of one user is selected and pre-processed. The data is split into segments of 10 seconds and filtered with Butterworth low-pass filtering using cut level 10 Hz and order of 4. The features are extracted from the filtered data from each segment of the X, Y and Z axis in the time and frequency domain and standardized. Then, K-means clustering is applied to find clusters with similar features. The results with different parameters k (the number of the clusters) of K-means clustering are first evaluated using the Silhouette coefficient that measures how far the data instances are from the data instances of the same cluster and other clusters in the scale from -1 to 1. The best parameter value of k is 2 with Silhouette coefficient 0,66. The k value 6 that is the true number of the labels is selected and has the Silhouette coefficient value 0,23. 59 5.3.2 Visualisation of clusters Scatter plots of the selected features added with the information of the clusters are plotted to analyze how well the features of the segments have been able to separate the clusters found by K-means clustering. For example, the largest value of Y axis and the mean of the magnitude values of Y axis can separate the 6 clusters assigned by the K-means clustering quite well. The scatter plots of some selected features are shown in Figure 10. Figure 10 Scatter plots of the selected features with the information of belonging to the clusters found by K-means clustering Next, the clusters found by K-means clustering in all the training data and the true clusters of the annotated data are visualized with PCA with two principal components. Some similarity can be seen between the clusters found by K-means clustering and the true clusters with the two principal components of PCA. It can also be seen that no annotated data is assigned to the cluster 3 found by K-means clustering. The results of the comparison are shown in Figure 11. 60 Figure 11 PCA with the clusters found by K-means clustering above and the true clusters of the annotated data below 61 5.3.3 Performance of K-means clustering A confusion matrix where the true labels of the annotated data and the labels predicted by K-means clustering are compared in a matrix is computed. Information about how the annotated data has been assigned to the clusters found by K-means clustering helps to evaluate the reliability of K-means clustering to assign all the data including both unannotated and annotated data into clusters. K-means clustering has been able to identify the actual cluster 6 (jogging) almost perfectly with accuracy near 100 %. Also, the actual cluster 1 (sitting) and 3 (walking around and doing tasks) have been recognized quite well, with 89 % and 84 % accuracies, although the latter has been split into two separate clusters. The actual cluster 0 (sleeping) has been confused with the actual cluster 1 (sitting) and 3 (walking around and doing tasks). The actual cluster 2 (sitting in a car) has been assigned to the same cluster as actual cluster 1 (sitting), and the actual cluster 5 (taking a walk) has been assigned to the same cluster as the actual cluster 6 (jogging). In addition, no data instances of the annotated data have been assigned to one cluster found by K-means clustering. This refers to an activity type that has not been performed when collecting and annotating data for one person. The confusion matrix of K-means clustering is shown in Figure 12. Figure 12 Confusion matrix of K-means clustering 62 In addition, ARI is computed to evaluate similarity between the clusters found by K- means clustering and the true clusters given in the annotation. A value close to 0 means random labeling and 1, that the clusters are identical. The value of ARI is 0,45 which shows that there is some similarity between the clusters found by K-means clustering compared to the true clusters. The clustering accuracy is 72 %. It is shown that it is possible to recognize physical activities from unannotated acceleration data of the Axivity accelerometer device positioned on a thigh with K-means clustering with 72 %, accuracy and ARI 0,45. 5.4 Training supervised classifiers Next, supervised machine learning methods KNN and RF are applied on the standardized features extracted from the filtered segments of the annotated data. The aim is to study, if a KNN model or a RF model trained on labeled data of one person can reliably predict labels and recognize activities from unannotated acceleration data of other persons. 5.4.1 KNN classification The best parameter value k (the number of the neighbors) is selected for KNN using a separate training set (three days) and test set (a new day) to avoid overfitting of the model because of possible dependencies during the same days. The best k value is 12 resulting in a C-index value 0,95. Other KNN parameters like different distance and weight parameters are also tested, however not improving the best result. The final model is trained with both the training and the test set of the previous phase and evaluated with a test set of two new days. With the k equals to 12 C-index and accuracy are 0,93 and 88%, respectively. 63 A confusion matrix, where the actual labels and the labels predicted by KNN are compared in a matrix, is computed. KNN has been able to identify the actual label 1 (sitting), 3 (walking around and doing tasks), and 5 (taking a walk) well with 94 %, 93 % and 91 % accuracies. The actual label 0 (sleeping) has somewhat been confused with the actual label 1 (sitting), the actual label 2 (sitting in a car) with the actual label 1 (sitting), and the actual label 6 (jogging) with the actual label 5 (taking a walk). Sleeping, sitting in a car, and jogging have been recognized with 81 %, 78 % and 87 % accuracy correspondingly. The confusion matrix is shown in Figure 13. Figure 13 Confusion matrix of KNN model The results show that the KNN classifier trained on annotated data of one person can predict activities from acceleration data of the same person with C-index value 0,93 and 88 % accuracy. 5.4.2 Random Forest classification The parameter n_estimators (the number of forests) is set to 500, and RF is applied on the same training and test sets as when evaluating the final KNN model. The result of C- index with the RF model is 0,93 and the accuracy is 88 %, the same as with the KNN model. 64 A confusion matrix is also computed for RF. Like KNN, RF has been able to identify the actual label 1 (sitting), 3 (walking around and doing tasks), and 5 (taking a walk) well with 95 %, 92 % and 92 % accuracies. Moreover, like with KNN, the actual label 0 (sleeping) has somewhat been confused with the actual label 1 (sitting), the actual label 2 (sitting in a car) with the actual label 1 (sitting), and the actual label 6 (jogging) with the actual label 5 (taking a walk). Sleeping, sitting in a car, and jogging have been recognized with 79 %, 82 % and 87 % accuracy. The confusion matrix is shown in Figure 14. Figure 14 Confusion matrix of RF model The results show that also the RF classifier trained on annotated data of one person has learned to recognize activities from acceleration data of the same person with the C-index value 0,93 and 88 % accuracy. 5.4.3 The importance of the features The importance of the features when training the RF model is calculated. This information would be useful in a feature selection phase that could be made to further improve the classification models. The results of the most important and least important features are plotted in Figures 15 and 16. 65 The 4 most important features have been extracted from the values of the Y axis. They are Y Largest, Y Magnitudes mean, Y Root mean square and Y Mean. The following features are next: X Root mean square, Y median, Y power spectral density of a frequency 7, X Magnitudes median, Y power spectral density of a frequency 6 and Z Median. The least important features are Y Peak widths sum, Y Magnitudes Kurtosis, Y dominant frequency, Z Dominant frequency, and X Dominant frequency. Figure 15 The most important features when training RF model Figure 16 The least important features when training RF model 66 5.4.4 Reliability of supervised classifiers The reliability of the previously trained KNN and RF classifiers to recognize activities from unannotated data of other persons is studied next. 5.4.4.1 Using activity levels as reference information Because there are no true labels and ground truth available, the cut point analysis of OMGUI software [18] is performed to obtain activity levels from the same data. The aim is to compare activities predicted by the classifiers to activity levels produced by the cut point analysis. The reliability of the models to predict physical activities from unannotated acceleration data can be studied with this reference information. 5.4.4.1.1 New metric: fraction of predictions with correct activity levels A new metric is introduced: a fraction of the labels that correspond to correct activity levels out of all the predicted labels. Different activity types should have activity levels that are shown in Table 5 in the section 4.2. The metric is used to get information about the reliability of the classifiers to predict labels and recognize activities from unannotated acceleration data. If a classifier predicts an activity type that can have an activity level predicted by the cut point analysis, the prediction is correct also according to this reference information. The new metric is able to highlight the predictions that have a wrong activity type according to the cut point analysis produced from the same data. For example, if sleeping or sitting has been predicted by the classifier, and an activity level predicted by the cut 67 point analysis is high or vigorous activity the metric interprets the prediction false. The metric cannot differentiate the predictions that share the same activity type. For example, if the prediction is sleeping and the true activity is sitting, the new metric interprets the prediction true because the activity could be sleeping also according to the cut point analysis. Despite of these limitations, the new metric can help to acquire information about the reliability of recognizing activities from unannotated acceleration data based on this additional source of information when there are no true labels and the ground truth available. 5.4.4.2 Predictions from unannotated data The previously trained KNN and RF classifiers are run to predict activities from unannotated data collected from 8 users of the Axivity device for one day each. The new metric, a fraction of the predictions that correspond to the correct activity levels in the cut point analysis, is then calculated for the supervised models. The results are 97 % for the KNN model and 98 % for the RF model. The KNN model predicts activities with the following results of the new metric: sleeping with 97 %, sitting, sitting in a car, and walking around and doing tasks with 100 %, taking a walk 57 %, and jogging with 84 %. The RF model predicts sleeping with 97 %, sitting, and sitting in a car with 100%, walking around and doing tasks 98 %, taking a walk 75 %, and jogging with 95 %. The results show that the KNN and RF classifiers trained with the small, labeled data of one person can make predictions that are correct also according to the activity levels predicted in the cut point analysis with 97 % and 98 % “accuracy” from unannotated acceleration data of other persons. The predicted activities sleeping, sitting, sitting in a car, and walking around and doing tasks correspond well to activity levels produced by the cut point analysis of the OMGUI software. Taking a walk has been predicted 68 somewhat differently from the expected activity levels. The RF classifier has predicted jogging corresponding well to the activity levels, while the KNN classifier has predicted jogging partly differently. Activity levels of the cut point analysis of the OMGUI software and the predictions made by the KNN and RF classifiers are compared in Figure 17. RF has predicted sleeping (0) RF has predicted sitting (1) RF has predicted sitting in a car (2) RF has predicted walking around and doing tasks (3) 69 RF has predicted taking a walk (5) RF has predicted jogging (6) Figure 17 Comparing activity levels of the cut point analysis and predictions of the KNN and RF classifiers 5.4.4.3 Predictions from annotated data The same metric is also computed for the supervised KNN and RF models that have predicted activities from the test data including only the labeled data of one person. The aim is to compare the results of the new metric to the results when predicting activities from unannotated data of other persons. The results are near 100 % for both the KNN and the RF model. The KNN model predicts activities with the following results: sleeping, sitting, sitting in a car, and walking around and doing tasks with 100 %, taking a walk with 91 %, and jogging with 95 % “accuracy”. The RF model predicts sleeping, sitting, sitting in a car, and walking around and doing tasks with 100 %, taking a walk with 94 %, and jogging with 96 %. The results show that all the predicted activities correspond well to activity levels of the cut point analysis of the OMGUI software when predictions have been made from the data of the same person whose data has been used in training. It is also shown that the predictions better correspond to the activity levels, compared to the results when making predictions from unannotated data of other persons. 70 Activity levels of the cut point analysis of the OMGUI software, true labels, and the predictions made by the KNN and RF classifiers are compared in Figure 18. True label sleeping (0) True label sitting (1) 71 True label sitting in a car (2) True label walking around and doing tasks (3) True label taking a walk (5) 72 True label jogging (6) Figure 18 Comparing activity levels of the cut point analysis, true labels, and predictions of the KNN and RF classifiers 5.5 Improving classifiers in semi- supervised setting Next, the En-Co-Training method is used with the supervised KNN and RF classifiers that have been trained with the labeled data of one person to leverage knowledge from unannotated acceleration data of other users of the Axivity device. It is studied if the classifiers can be improved to better generalize on unannotated data of other persons. It is investigated if the KNN and RF classifiers retrained in a semi-supervised setting can recognize activities more reliably than the initial supervised classifiers. The initial KNN and RF classifiers first make predictions from the training data of 2 persons. The predictions that both the classifiers have consensus about and have a right corresponding activity level are accepted. In addition, if one of the models has predicted jogging and the activity level is vigorous activity, the prediction is considered to be confident enough and is accepted because jogging should clearly correspond to this one activity level. The accepted predictions are added to the set of the true labels as pseudo- labels and the corresponding data instances are added to the common training set of the 73 models. The models are retrained, and new predictions are made from the training data of 2 new persons. This is iterated until predictions have been made from all the training data of 8 persons. Instead of using majority voting of three classifiers like in the work [35] the KNN and RF classifiers are used separately to make predictions. In addition, unlike in [35] also the activity levels produced by the cut point analysis of the OMGUI software are considered as explained previously to increase the confidence of the accepted predictions. This way both information can be leveraged from unannotated acceleration data, and activity levels can be used as additional source of information to increase the confidence of selected pseudo-labels in the semi-supervised setting. 5.5.1 Reliability of classifiers in semi- supervised setting 5.5.1.1 Predictions from unannotated data The semi-supervised training of the initial KNN and RF classifiers is performed and evaluated three times on separate training and test sets of unannotated acceleration data. In each test round one-day unannotated acceleration data collected from 8 individuals are used as the training data, and one-day unannotated acceleration data of 4 individuals as the test data. Similar to the previous evaluation, the new metric, a fraction of the predictions that correspond to the correct activity levels in the cut point analysis of the OMGUI software, is calculated to evaluate the reliability of the models. In addition, the number of predictions that correspond to correct activity levels is calculated. The results of both the initial and the retrained classifiers are shown in Table 9. If the number of correct predictions shown in parenthesis has improved compared to the initial classifier, the result 74 is bolded. In the test round 3 no jogging has been performed by the persons during the selected days. Therefore, the last line in Table 9 is excluded from the comparison of the results. Table 9 A fraction and the number of predictions that correspond to correct activity levels of the cut point analysis Predicted Activity Initial KNN classifier Retrained KNN classifier Initial RF classifier Retrained RF classifier Test 1 all activities 97 % (32264) 98 % (32597) 98 % (32597) 99 % (32929) Sleeping 99 % (12123) 99 % (13612) 98 % (11914) 99 % (12868) Sitting 100 % (13060) 100 % (11751) 100 % (12503) 100 % (12189) Sitting in a car 100 % (183) 100 % (224) 100 % (180) 100 % (177) Walking around and doing tasks 100 % (5839) 100 % (5883) 98 % (6842) 100 % (6232) Taking a walk 58 % (905) 67 % (648) 81 % (510) 77 % (627) Jogging 96 % (360) 97 % (668) 95 % (770) 95 % (810) Test 2 all activities 96 % (33074) 99 % (34108) 97 % (33419) 98 % (33764) Sleeping 95 % (12179) 98 % (13215) 96 % (13157) 95 % (12981) Sitting 100 % (13350) 100 % (11861) 100 % (10854) 100 % (11677) Sitting in a car 100 % (1632) 100 % (1760) 100 % (2429) 100 % (2040) Walking around and doing tasks 100 % (5081) 100 % (6324) 97 % (6564) 100 % (5890) Taking a walk 55 % (836) 77 % (358) 63 % (219) 81 % (516) Jogging 0 % (0) 98 % (548) 95 % (334) 98 % (535) Test 3 all activities 98 % (33043) 98 % (33043) 98 % (33043) 98 % (33043) Sleeping 98 % (12520) 98 % (12114) 97 % (12540) 98 % (12814) Sitting 100 % (11809) 100 % (11566) 100 % (11998) 100 % (11285) Sitting in a car 100 % (804) 100 % (839) 100 % (1013) 100 % (886) Walking around and doing tasks 100 % (7326) 100 % (8341) 99 % (7324) 100 % (7771) Taking a walk 81 % (799) 73 % (428) 79 % (291) 75 % (496) Jogging 0 % (0) 8 % (2) 23 % (3) 5 % (2) The overall results are 96-98 % for the initial classifiers and 98-99 % for the retrained classifiers. Sleeping, sitting, sitting in a car, and walking around and doing tasks have been predicted with over 95 % by all the initial and retrained classifiers. The results of taking a walk predicted by the KNN classifiers have changed from 55-81 % to 67-77 % and jogging from 0-96 % to 97-98 %. The results of the RF classifiers when predicting taking a walk and jogging have changed from 63-81 % to 77-81 %, and from 95 % to 95- 98 % respectively. The number of correctly predicted activities taking a walk and jogging have either stayed the same or improved by all the retrained RF classifiers. The number of correctly predicted jogging has always improved by the retrained KNN classifiers. 75 It can be concluded that the semi-supervised setting using the En-Co-Training method and the KNN and RF classifiers trained with only small, annotated data of one person can improve the initial supervised KNN and RF classifiers. In addition, the retrained classifiers can predict activities taking a walk and jogging more reliably from acceleration data of other persons. 5.5.1.2 Predictions from annotated data To evaluate the reliability of the KNN and RF classifiers retrained in the 3 test rounds in the semi-supervised setting, they are also tested on the same annotated test data of one person that has been used when evaluating the initial supervised classifiers. The new metric, a fraction of the predictions that correspond to the correct activity levels in the cut point analysis, is first calculated. Like with the initial supervised classifiers, the overall results are near 100 % for both the KNN and the RF models. Also, all the results of all the activities are between 90 % and 100 % like with the initial KNN and RF classifiers tested earlier. The C-index of the retrained KNN and RF classifiers is also calculated. The C-index values are 0,94, 0,93 and 0,93 in the test round 1, 2 and 3, respectively. The C-index value has either stayed the same or improved compared to the C-index value 0,93 of the initial classifiers. The accuracies have also stayed the same in all the test rounds: i.e., 88 % for both the classifiers. The confusion matrices are plotted from the results of the retrained KNN and RF classifiers in the test round 1 in Figures 19 and 20. The results are very similar to the results of the initial supervised classifiers tested on the same data. 76 Figure 19 Confusion matrix of retrained KNN model Figure 20 Confusion matrix of retrained RF model The results show that the KNN and RF classifiers retrained in a semi-supervised setting can recognize activities as well as the initial KNN and RF classifiers when measuring them both with the new metric and with the C-index value and accuracy. Retraining the KNN and RF classifiers in a semi-supervised setting has not decreased their performance on the annotated test data. 77 6 Discussion This thesis introduces a solution to extract physical activities from unannotated acceleration data collected with an Axivity device positioned on a thigh using traditional unsupervised, supervised, and semi-supervised machine learning methods. It is shown to be beneficial to collect and label new acceleration data although for only one person and to use the labeled data to develop supervised KNN and RF classifiers to retrain them in a semi-supervised setting using the En-Co-Training method. A new metric is proposed: a fraction of the labels that correspond to correct activity levels out of all the predicted labels according to the cut-point analysis of the OMGUI software [18]. The reliability of the classifiers is shown to consistently improve when comparing the retrained KNN and RF classifiers to the initial ones with the new metric. Although deep learning methods have outperformed traditional machine learning methods in HAR, most of them are supervised methods that require an extensive amount of labeled training data and are not feasible solutions in this thesis when only unlabeled acceleration of 12 people is available. Deep unsupervised and semi-supervised methods are not suitable either because of the small amount of available data. The traditional KNN and RF methods are easy to implement, and they have been successful in HAR outperforming methods like SVM and Naïve Bayes [28,53]. Also, the En-Co-Training method has performed well in HAR [35]. Furthermore, traditional machine learning methods are more competitive with deep learning methods when an objective is to recognize basic physical activities such as sleeping, sitting, sitting in a car, walking around and doing tasks, taking a walk and jogging that are used as activity types in this thesis. Physical activity of people can be interpreted with these basic activities common in daily life. They can also easily be compared with activity levels, reference information that can be obtained with the cut point analysis of the OMGUI software. However, if very different activities have been performed by other persons the classifiers might be inaccurate. Performing and labeling 78 more different activity types and adding them to the training data could improve the reliability of the supervised classifiers. The classifiers retrained in the semi-supervised setting could be further improved by adding more iterations and more unannotated acceleration data to the training data of the En-Co-Training method. In that way, more examples of activities performed by other people would be added. Adding more unannotated data would also increase the risk of choosing wrong predictions as pseudo-labels, and it might decrease the performance compared to the initial supervised classifiers. To mitigate this risk, the reliability of the retrained classifiers should be compared to the supervised classifiers after retraining. Also, the reliability of the classifiers could be improved by using data collected from another device, for example a smartwatch positioned on a wrist in addition to the Axivity accelerometer positioned on a thigh. It would be possible to better separate stationary activities like sleeping and sitting where the position of a thigh may be quite identical or taking a walk from walking around and doing tasks. It would require collecting and annotating new acceleration data using both the devices, training supervised classifiers on new training data, and retraining them in the semi-supervised setting. 79 7 Conclusion In this thesis, the objective was to develop a machine learning solution that can recognize physical activities from unannotated acceleration data collected with an Axivity accelerometer positioned on a thigh. The solution was tested on real-life acceleration data collected from 12 people. It is a challenge in HAR to annotate acceleration data, and the existing approaches in HAR mostly use supervised machine learning methods that require true labels. It was studied if different activities can reliably be extracted from unannotated acceleration data only using unsupervised machine learning methods. Furthermore, it was examined if small, labeled data collected from one person can be utilized with supervised and semi-supervised machine learning methods so that they can recognize activities reliably. In addition, it was studied how to get information about the reliability of the used machine learning methods without knowing true labels and the ground truth. After a brief introduction to HAR using wearable devices and machine learning, characteristics, and challenges in HAR as well as machine learning methods and evaluation metrics that are commonly used in HAR were presented in Chapter 2. In Chapter 3 the current state of research in sensor based HAR was studied, and works that have successfully used supervised, unsupervised, and semi-supervised machine learning methods were introduced. The works were summarized with used sensors, methods, evaluation metrics, and physical activities that had been recognized. Also, open questions in HAR and new promising research areas that aim at utilizing continuously streaming unlabeled acceleration data were introduced. Machine learning solutions were developed for recognizing physical activities from unlabeled acceleration data collected with an Axivity accelerometer, and they were described in Chapter 4. First, new acceleration data was collected with the Axivity device positioned on a thigh and annotated for one person. The unsupervised machine learning method K-means clustering was then applied on the preprocessed data including both one-day unannotated data of 8 individuals and new, labeled acceleration data of one person collected for 4 days. The reliability of the K-means clustering to find correct 80 clusters related to performed physical activities was evaluated studying if the model had been able to assign the true labels to correct clusters. Second, supervised machine learning classifiers were trained on the labeled acceleration data of one person collected for 4 days. The KNN and RF classifiers were first used to predict activities from labeled data collected from the same person for 3 separate days to evaluate their performance with the known true labels. Then, the classifiers were used to automatically annotate one-day unlabeled acceleration data of 8 individuals to study if the supervised classifiers which were trained on the labeled data of one person could reliably recognize activities from unlabeled acceleration data. Third, the En-Co-Training method was used to retrain the supervised KNN and RF classifiers in a semi-supervised setting with the training data of 8 persons and the test data of 4 persons collected for one day each in tree test rounds. The activity levels produced by the cut point analysis of the OMGUI software [18] were also used as additional information when choosing confident pseudo-labels. The retrained classifiers were used to automatically annotate unlabeled acceleration data to study if the semi-supervised setting helped the classifiers to predict physical activities more reliably. In the experiments in Chapter 5 it was shown that the unsupervised K-means clustering could recognize physical activities from data including both unannotated and annotated acceleration data with the ARI value 0,45 and 72 % clustering accuracy. Although the K- means clustering recognized jogging almost perfectly, the stationary activities sleeping and sitting were confused, sitting in a car was assigned to the same cluster as sitting, and taking a walk was assigned to the same cluster as jogging. Also, there seemed to be an activity type that had not been performed when collecting annotated data. Both the supervised KNN and RF classifiers trained on the labeled data of one person could recognize activities from the data of the same person with the C-index 0,93 and 88 % accuracy. The activities sitting, walking around and doing tasks, and taking a walk were recognized well, but the stationary activities sleeping and sitting in a car had 81 somewhat been confused with sitting, and jogging was partly misclassified as taking a walk. However, it was more important to evaluate how reliably the supervised KNN and RF classifiers were able to recognize activities from unlabeled data of other persons. A new metric was proposed: i.e., a fraction of predictions that have a correct activity level according to the cut point analysis of the OMGUI software run on the same unlabeled acceleration data out of all the predictions. This metric was only able to highlight if a classifier predicted an activity that should have a different activity level than the cut point analysis had predicted. However, it was valuable information when no true labels and the ground truth were available. The new metric was calculated both for the initial supervised KNN and RF classifiers and the classifiers retrained in the semi-supervised setting. The overall results of the initial supervised classifiers to recognize activities from unlabeled data of other users were 96- 98 %. The results were 95-100 % for all other activities but 55-81 % for taking a walk and 0-95 % for jogging. The overall results of the classifiers retrained in the semi- supervised setting were 98-99 %. The results of activity types were 95-100 % for all other activities, but 67-81 % for taking a walk, and 95-98 % for jogging. In addition, the number of correctly predicted activities taking a walk and jogging had either stayed the same or improved by the retrained RF classifier in all the test rounds. It was shown that the semi- supervised setting improved the reliability of the classifiers to predict activities that have a correct activity level also according to the cut-point analysis of the OMGUI software. It could be concluded that when only using unlabeled acceleration data and the unsupervised K-means clustering method the reliability of recognizing activities remained quite modest. The model had challenges to separate stationary activities, and it could not differentiate sitting in a car from sitting and taking a walk from jogging. It was beneficial to collect and label new acceleration data although for only a single person. The labeled data could be used to train supervised KNN or RF classifiers to recognize activities from unannotated acceleration data of other users. Furthermore, the reliability of the KNN and RF classifiers could consistently be improved when they were retrained 82 in a semi-supervised setting using the En-Co-Training method with the initial supervised KNN and RF classifiers leveraging knowledge from unannotated data. In addition, reference information of the cut point analysis of the OMGUI software could be used to further reduce the risk of choosing wrong pseudo-labels. It was possible to get information about the reliability of the supervised and semi- supervised classifiers with the new metric, a fraction of predictions that correspond to the activity levels also predicted by the cut point analysis of the OMGUI software out of all the predictions. The metric could not separate activity types that share the same activity level, but with the new metric, it was possible to evaluate how reliably a classifier had predicted activities with correct activity levels also according to the cut point analysis run on the same data. 83 References [1] Sheng Taoran: Learning Embeddings for Wearable-based Human Activity Analysis. University of Texas Arlington Theses and Dissertations (library), 2020. [2] Sreenivasan Ramasamy Ramamurthy, Nirmalya Roy: Recent Trends in Machine Learning for Human Activity Recognition - A Survey. WIREs Data Mining and Knowledge Discovery 8(4), 2018. [3] Alireza Abedin, Farbod Motlagh, Qinfeng Shi, Damith Rezatofighi, Chinthana Ranasinghe: Towards deep clustering of human activities from wearables. ISWC '20: Proceedings of the 2020 International Symposium on Wearable Computers:1-6, 2020. [4] Nauman Ahad, Mark A. Davenport: Semi-supervised sequence classification through change point detection. ArXiv Computer Science, Machine Learning, 2020. [5] Mohammad Sabik Irbaz Abir Azad, Tanjila Alam Sathi, Lutfun Nahar Lota: Nurse Care Activity Recognition Based on Machine Learning Techniques Using Accelerometer Data. UbiComp-ISWC '20: Adjunct Proceedings of the 2020 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers:402-407, 2020. [6] D. Jakhar, I. Kaur: Artificial intelligence, machine learning and deep learning: definitions and differences. Clinical and Experimental Dermatology 45(1):131- 132, 2020. [7] Sunita Kumari Chaurasia, S.R.N Reddy: AI Assisted Human Activity Recognition (HAR). International Journal of Engineering and Advanced Technology (IJEAT) ISSN 8(6):2249-8958, 2019. [8] Yongjin Kwon, Kyuchang Kang, Changseok Bae: Unsupervised learning for human activity recognition using smartphone sensors. Expert Systems with Applications 41(14):6067-6074, 2014. [9] M.R. Berthold, C. Borgelt, F. Höppner, F. Klawonn: Guide to Intelligent Data Analysis: How to Intelligently Make Sense of Real Data. Springer, London, 2010. [10] Marcio de Almeida Mendes, Inacio da Silva, Virgilio Ramires, Felipe Reichert, Rafaela Martins, Rodrigo Ferreira, Elaine Tomasi: Metabolic equivalent of task (METs) thresholds as an indicator of physical activity intensity. Plos One, 2013. [11] https://axivity.com/downloads/ax3 [12] Uday Shankar Shanthamallu, Andreas Spanias, Cihan Tepedelenlioglu, Mike Stanley: A brief survey of machine learning methods and their sensor and IoT applications. 2017 8th International Conference on Information, Intelligence, Systems & Applications (IISA), 2018. [13] Fei Hu, Qi Hao: Intelligent Sensor Networks, The Integration of Sensor Networks, Signal Processing and Machine Learning. CRC Press, 2012. [14] Jürgen Schmidhuber: Deep learning in neural networks: An overview. Neural Networks 61:85-117, 2015. 84 [15] Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, Lisha Hu: Deep learning for sensor-based activity recognition: A survey. Pattern Recognition Letters 119:3-11, 2019. [16] Oscar D. Lara, Miguel A. Labrador: A Survey on Human Activity Recognition using Wearable Sensors. IEEE Communications Surveys & Tutorials 15(3):1192-1209, 2013. [17] Dale Esliger, Alex Rowlands, Tina Hurst, Michael Catt, Peter Murray, Roger Eston: Validation of the GENEA accelerometer. Medicine and Science in Sports and Exercise, 43(6):1085-1093, 2010. [18] AX3 GUI · digitalinteraction/openmovement Wiki · GitHub [19] Oresti Banos, Juan-Manuel Galvez, Miguel DamasOrcID, Hector Pomares, Ignacio Rojas: Window Size Impact in Human Activity Recognition. MDPI Open Access Journals, Sensors 2014 14(4):6474-6499, 2014. [20] Prajoy Podder, Mehedi Hasan, Rafiqul Islam, Mursalin Sayeed: Design and Implementation of Butterworth, Chebyshev-I and Elliptic Filter for Speech Signal Analysis. International Journal of Computer Applications 98(7):12-18, 2014. [21] E. O. Brigham, R. E. Morrow: The fast Fourier transform. IEEE Spectrum 4(12):63-70, 1967. [22] Zhenyu He, Lianwen Jin: Activity Recognition from acceleration data Based on Discrete Consine Transform and SVM. 2009 IEEE International Conference on Systems, Man and Cybernetics, 2009. [23] Zhenyu He: Activity Recognition from Accelerometer Signals Based on Wavelet-AR Model. 2010 IEEE International Conference on Progress in Informatics and Computing, 2010. [24] Pekka Siirtola, Juha Röning: Incremental Learning to Personalize Human Activity Recognition Models: The Importance of Human AI Collaboration. Sensors 2019 19(23), 2019. [25] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, Jorge Luis Reyes-Ortiz: A public domain dataset for human activity recognition using smartphones. ESANN 2013 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 2013. [26] Thomas Stiefmeier, Daniel Roggen, Georg Ogris, Paul Lukowicz, Gerhard Tröster: Wearable activity tracking in car manufacturing. IEEE Pervasive Computing 7(2):42-50, 2008. [27] Oresti Banos, Rafael Garcia, Juan A Holgado-Terriza, Miguel Damas, Hector Pomares, Ignacio Rojas, Alejandro Saez, Claudia Villalonga: A novel Framework for Agile Development of Mobile Health Applications. Lecture Notes in Computer Science 8868:91-98, 2014. [28] Attila Reiss, Didier Stricker: Introducing a new benchmarked dataset for activity Monitoring. 2012 16th International Symposium on Wearable Computers:108- 109, 2012. [29] Oresti Baños, Miguel Damas, Ignacio Rojas, Máté Attila Tóth, Oliver Amft: A benchmark dataset to evaluate sensor displacement in activity recognition. UbiComp '12: Proceedings of the 2012 ACM Conference on Ubiquitous Computing:1026-1035, 2012. [30] D. Anguita, A. Ghio, L. Oneto, X. Parra, Jorge Luis Reyes-Ortiz: A public domain dataset for human activity recognition using smartphones. ESANN 2013 85 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence, and machine learning, 2013. [31] Ian Goodfellow, Yoshua Bengio, Aaron Courville: Deep learning. MIT Press, 2016. [32] Lu Bai, Chris Yeung, Christos Efstratiou, Moyra Chikomo: Motion2vector: unsupervised learning in human activity recognition using wrist-sensing data. UbiComp/ISWC ’19 Adjunct: Adjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium on Wearable Computers:537-542, 2019. [33] Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjærgaard, Anind Dey, Tobias Sonne, Mads Møller Jensen: Smart devices are different: Assessing and mitigating mobile sensing heterogeneities for activity recognition. In proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems:127-140, 2015. [34] Matthias Schmid, Marvin N. Wright, Andreas Ziegler: On the use of Harrell’s C for clinical risk prediction via random survival forests. Expert Systems with Applications (63):450-459, 2016. [35] Brent Longstaff, Sasank Reddy and Deborah Estrin: Improving activity classification for health applications on mobile devices using active and semi- supervised learning. 2010 4th International Conference on Pervasive Computing Technologies for Healthcare:1-7, 2010. [36] Yonggang Lu, Ye Wei, Li Liu, Jun Zhong, Letian Sun, Ye Liu: Towards unsupervised physical activity recognition using smartphone accelerometers. Multimedia Tools and Applications 76(8):10701-10719, 2017. [37] Jennifer R. Kwapisz, Gary M. Weiss, Samuel A. Moore: Activity Recognition using Cell Phone Accelerometers. ACM SigKDD Explorations Newsletter 12(2):74-82, 2011. [38] Nabil Alshurafa, Wenyao Xu, Jason J. Liu, Ming-Chun HuangBobak Mortazavi, Christian K. Roberts, Majid Sarrafzadeh: Designing a Robust Activity Recognition Framework for Health and Exergaming Using Wearable Sensors. IEEE Journal of Biomedical and Health Informatics 18(5), 2014. [39] Feng Siwei: Sparsity in Machine Learning: An Information Selecting Perspective. Doctoral Dissertations, 2019. [40] Ming Zeng, Tong Yu, Xiao Wang, Le T Nguyen, Ole J Mengshoel, Ian Lane: Semi-supervised convolutional neural networks for human activity recognition. 2017 IEEE International Conference on Big Data (Big Data), 2017. [41] Maja Stikic, Kristof Van Laerhoven, Bernt Schiele: Exploring semi-supervised and active learning for activity recognition. 2008 12th IEEE International Symposium on Wearable Computers:81-88, 2008. [42] Beth Logan, Jennifer Healey, Mattahai Philipose, Emmanuel Mungia Tapia, Stephen Intille: A Long-Term Evaluation of Sensing Modalities for Activity Recognition. Ubicomp 2007: Ubiquitous Computing:483-500, 2007. [43] Francisco Javier Ordóñez and Daniel Roggen: Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16(1):115, 2016. 86 [44] Shaojie Bai, J. Zico Kolter, Vladlen Koltun: An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. Cornell University Computer Science Machine Learning, 2018. [45] Kilian Forster, Daniel Roggen, Gerhard Troster: Unsupervised classifier self- calibration through repeated context occurences: Is there robustness against sensor displacement to gain? 2009 International Symposium on Wearable Computers:77-84, 2009. [46] Jennifer R. Kwapisz, Gary M. Weiss, and Samuel A. Moore: Activity recognition using cell phone accelerometers. ACM SIGKDD Explorations Newsletter 12(2):74-82, 2011. [47] Antti Rasmus, Harri Valpola, Mikko Honkala, Mathias Berglund, Tapani Raiko: Semi-supervised learning with ladder networks. arXiv, Computer Science, Neural and Evolutionary Computing, 2015. [48] Luciana C. Jatoba, Ulrich Grossmann, Chistophe Kunze, Jorg Ottenbacher, Wilhelm Stork: Context-aware mobile health monitoring: Evaluation of different pattern recognition methods for classification of physical activity. 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society:5250-5253, 2008. [49] Uwe Maurer, Asim Smailagic, Daniel P. Siewiorek, Michael Deisher: Activity recognition and monitoring using multiple sensors on different body positions. International Workshop on Wearable and Implantable Body Sensor Networks:4- 116, 2006. [50] Ling Bao, Stephen S. Intille: Activity recognition from user-annotated acceleration data. Pervasive Computing:1-17, 2004. [51] Nishkam Ravi, Nikhil Dandekar, Preetham Mysore, Michael L. Littman: Activity recognition from accelerometer data. IAAI'05: Proceedings of the 17th conference on Innovative applications of artificial intelligence 3:1541-1546, 2005. [52] Illapha Cuba Gyllensten, Alberto G. Bonomi: Identifying Types of Physical Activity with a Single Accelerometer: Evaluating Laboratory-trained Algorithms in Daily Life. IEEE Transactions on Biomedical Engineering 58(9):2656-2663, 2011. [53] Attal Ferhat, Mohammed Samer, Dedabrishvili Mariam, Chamroukhi Faicel, Oukhellou Latifa, Amirat Yacine: Physical Human Activity Recognition Using Wearable Sensors. Sensors 2015 15(12):31314-31338, 2015. [54] Daniel Roggen, Alberto Calatroni, Mirco Rossi, Thomas Holleczek, Kilian Forster, Gerhard Troster, Paul Lukowicz, David Bannach, Gerald Pirkl, Alois Ferscha, Jakob Doppler, Clemens Holzmann, Marc Kurz, Gerald Holl, Ricardo Chavarriaga, Hesam Sagha, Hamidreza Bayati, Marco Creatura, Jose del R. Millan: Collecting complex activity data sets in highly rich networked sensor environments. 2010 Seventh International Conference on Networked Sensing Systems (INSS):233-240, 2010. [55] Davide Buffelli, Fabio Vandin: Attention-Based Deep Learning Framework for Human Activity Recognition with User Adaptation. arXiv, Computer Science, Machine Learning, 2020. [56] Mi Zhang, Alexander A. Sawchuk: Usc-had: a daily activity dataset for ubiquitous activity recognition using wearable sensors. Proceedings of the 2012 ACM Conference on Ubiquitous Computing:1036-1042, 2012. 87 [57] Jue Wang, Zhibin Huang, Huanyuan Xu, Zilu Kang: Clustering Analysis of Human Behavior Based on Mobile Phone Sensor Data. Proceedings of the 2018 10th International Conference on Machine Learning and Computing:64-68, 2018. [58] Zahraa Said Abdallah, Mohamed Medhat Gaber profile, Bala Srinivasan, Shonali Priyadarsini Krishnaswamy: Activity Recognition with Evolving Data Streams: A Review. ACM Computing Surveys 51(4), 2018.