Forecasting players in sports has grown in popularity due to the potential for a tactical advantage and the applicability of such research to multi-agent interaction systems. Team sports contain a significant social component that influences interactions between teammates and opponents. However, it still needs to be fully exploited. In this work, we hypothesize that each participant has a specific function in each action and that role-based interaction is critical for predicting players’ future moves. We create RolFor, a novel end-to-end model for Role-based Forecasting. RolFor uses a new module we developed called Ordering Neural Networks (OrderNN) to permute the order of the players such that each player is assigned to a latent role. The latent role is then modeled with a RoleGCN. Thanks to its graph representation, it provides a fully learnable adjacency matrix that captures the relationships between roles and is subsequently used to forecast the players’ future trajectories. Extensive experiments on a challenging NBA basketball dataset back up the importance of roles and justify our goal of modeling them using optimizable models. When an oracle provides roles, the proposed RolFor compares favorably to the current state-of-the-art (it ranks first in terms of ADE and second in terms of FDE errors). However, training the end-to-end RolFor incurs the issues of differentiability of permutation methods, which we experimentally review. Finally, this work restates differentiable ranking as a difficult open problem and its great potential in conjunction with graph-based interaction models.
Promptly identifying procedural errors from egocentric videos in an online setting is highly challenging and valuable for detecting mistakes as soon as they happen. This capability has a wide range of applications across various fields, such as manufacturing and healthcare. The nature of procedural mistakes is open-set since novel types of failures might occur, which calls for one-class classifiers trained on correctly executed procedures. However, no technique can currently detect open-set procedural mistakes online. We propose PREGO, the first online one-class classification model for mistake detection in PRocedural EGOcentric videos. PREGO is based on an online action recognition component to model the current action, and a symbolic reasoning module to predict the next actions. Mistake detection is performed by comparing the recognized current action with the expected future one. We evaluate PREGO on two procedural egocentric video datasets, Assembly101 and Epic-tent, which we adapt for online benchmarking of procedural mistake detection to establish suitable benchmarks, thus defining the Assembly101-O and Epic-tent-O datasets, respectively.
Anomalies are rare and anomaly detection is often therefore framed as One-Class Classification (OCC), i.e. trained solely on normalcy. Leading OCC techniques constrain the latent representations of normal motions to limited volumes and detect as abnormal anything outside, which accounts satisfactorily for the openset’ness of anomalies. But normalcy shares the same openset’ness property, since humans can perform the same action in several ways, which the leading techniques neglect. We propose a novel generative model for video anomaly detection (VAD), which assumes that both normality and abnormality are multimodal. We consider skeletal representations and leverage state-of-the-art diffusion probabilistic models to generate multimodal future human poses. We contribute a novel conditioning on the past motion of people, and exploit the improved mode coverage capabilities of diffusion processes to generate different-but-plausible future motions. Upon the statistical aggregation of future modes, anomaly is detected when the generated set of motions is not pertinent to the actual future. We validate our model on 4 established benchmarks: UBnormal, HR-UBnormal, HR-STC, and HR-Avenue, with extensive experiments surpassing state-of-the-art results.
Self-paced learning has been beneficial for tasks where some initial knowledge is available, such as weakly supervised learning and domain adaptation, to select and order the training sample sequence, from easy to complex. However its applicability remains unexplored in unsupervised learning, whereby the knowledge of the task matures during training. We propose a novel HYperbolic Self-Paced model (HYSP) for learning skeleton-based action representations. HYSP adopts self-supervision: it uses data augmentations to generate two views of the same sample, and it learns by matching one (named online) to the other (the target). We propose to use hyperbolic uncertainty to determine the algorithmic learning pace, under the assumption that less uncertain samples should be more strongly driving the training, with a larger weight and pace. Hyperbolic uncertainty is a by-product of the adopted hyperbolic neural networks, it matures during training and it comes with no extra cost, compared to the established Euclidean SSL framework counterparts. When tested on three established skeleton-based action recognition datasets, HYSP outperforms the state-of-the-art on PKU-MMD I, as well as on 2 out of 3 downstream tasks on NTU-60 and NTU-120. Additionally, HYSP only uses positive pairs and bypasses therefore the complex and computationally-demanding mining procedures required for the negatives in contrastive techniques. Code is available at this https URL.
The task of collaborative human pose forecasting stands for predicting the future poses of multiple interacting people, given those in previous frames. Predicting two people in interaction, instead of each separately, promises better performance, due to their body-body motion correlations. But the task has remained so far primarily unexplored. In this paper, we review the progress in human pose forecasting and provide an in-depth assessment of the single-person practices that perform best for 2-body collaborative motion forecasting. Our study confirms the positive impact of frequency input representations, space-time separable and fully-learnable interaction adjacencies for the encoding GCN and FC decoding. Other single-person practices do not transfer to 2-body, so the proposed best ones do not include hierarchical body modeling or attention-based interaction encoding. We further contribute a novel initialization procedure for the 2-body spatial interaction parameters of the encoder, which benefits performance and stability. Altogether, our proposed 2-body pose forecasting best practices yield a performance improvement of 21.9% over the state-of-the-art on the most recent ExPI dataset, whereby the novel initialization accounts for 3.5%. See our project page at https://www.pinlab.org/bestpractices2body.
The progress in modelling time series and, more generally, sequences of structured-data has recently revamped research in anomaly detection. The task stands for identifying abnormal behaviours in financial series, IT systems, aerospace measurements, and the medical domain, where anomaly detection may aid in isolating cases of depression and attend the elderly. Anomaly detection in time series is a complex task since anomalies are rare due to highly non-linear temporal correlations and since the definition of anomalous is sometimes subjective. Here we propose the novel use of Hyperbolic uncertainty for Anomaly Detection (HypAD). HypAD learns self-supervisedly to reconstruct the input signal. We adopt best practices from the state-of-the-art to encode the sequence by an LSTM, jointly learnt with a decoder to reconstruct the signal, with the aid of GAN critics. Uncertainty is estimated end-to-end by means of a hyperbolic neural network. By using uncertainty, HypAD may assess whether it is certain about the input signal but it fails to reconstruct it because this is anomalous; or whether the reconstruction error does not necessarily imply anomaly, as the model is uncertain, e.g. a complex but regular input signal. The novel key idea is that a detectable anomaly is one where the model is certain but it predicts wrongly. HypAD outperforms the current state-of-the-art for univariate anomaly detection on established benchmarks based on data from NASA, Yahoo, Numenta, Amazon, Twitter. It also yields state-of-the-art performance on a multivariate dataset of anomaly activities in elderly home residences, and it outperforms the baseline on SWaT. Overall, HypAD yields the lowest false alarms at the best performance rate, thanks to successfully identifying detectable anomalies.
Transformer Networks have established themselves as the de-facto state-of-the-art for trajectory forecasting but there is currently no systematic study on their capability to model the motion patterns of people, without interactions with other individuals nor the social context. There is abundant literature on LSTMs, CNNs and GANs on this subject. However methods adopting Transformer techniques achieve great performances by complex models and a clear analysis of their adoption as plain sequence models is missing. This paper proposes the first in-depth study of Transformer Networks (TF) and the Bidirectional Transformers (BERT) for the forecasting of the individual motion of people, without bells and whistles. We conduct an exhaustive evaluation of the input/output representations, problem formulations and sequence modelling, including a novel analysis of their capability to predict multi-modal futures. Out of comparative evaluation on the ETH+UCY benchmark, both TF and BERT are top performers in predicting individual motions and remain within a narrow margin wrt more complex techniques, including both social interactions and scene contexts. Source code will be released for all conducted experiments.
Few-shot fine-grained classification and person search appear as distinct tasks and literature has treated them separately. But a closer look unveils important similarities: both tasks target categories that can only be discriminated by specific object details; and the relevant models should generalize to new categories, not seen during training. We propose a novel unified Query-Guided Network (QGN) applicable to both tasks. QGN consists of a Query-guided Siamese-Squeeze-and-Excitation subnetwork which re-weights both the query and gallery features across all network layers, a Query-guided Region Proposal subnetwork for query-specific localisation, and a Query-guided Similarity subnetwork for metric learning. QGN improves on a few recent few-shot fine-grained datasets, outperforming other techniques on CUB by a large margin. QGN also performs competitively on the person search CUHK-SYSU and PRW datasets, where we perform in-depth analysis.
Fault zone properties can change significantly during the seismic cycle in response to stress changes, microcracking and wall rock damage. Lab experiments show consistent changes in elastic properties prior to and after lab earthquakes (EQ) and previous works show that machine learning/deep learning (ML/DL) techniques are successful for capturing such changes. Here, we apply DL techniques to assess whether similar changes occur during the seismic cycle of tectonic EQ. The main motivation is to generalize lab-based findings to tectonic faulting, to predict failure and identify precursors. The novelty is that we use EQ traces as probing signals to estimate the fault state. We train DL model to distinguish foreshocks, aftershocks and time to failure of the Mw 6.5 2016 Norcia EQ in central Italy, October 30th 2016. We analyze a 25-second window of 3-component data around the P- and S-wave arrivals for events near the Norcia fault with M>0.5 and ±2 months before/after the Norcia mainshock. Normalized waveforms are used to train a Convolutional Neural Network (CNN). As a first task we divide events into two classes (foreshocks/aftershocks), and then refine the classification as a function of time-to-failure (TTF) for the mainshock. Our DL model perform very well for TTF classification into 2, 4, 8, or 9-classes for the 2 months before/after the mainshock. We explore a range of seismic ray paths near, through, and away from the Norcia mainshock fault zone. Model performance exceeds 90% for most stations. Waveform investigations show that wave amplitude is not the key factor; other waveform properties dictate model performance. Models derived from seismic spectra, rather than time-domain data, are equally good. We challenged the model in several ways to confirm the results. We found reduced performance in training the model with the wrong mainshock time and by omitting data immediately before/after the mainshock. Foreshock/aftershock identification is significantly degraded also by removing high frequencies (filtering seismic data above 25 Hz). We tested data from different years to understand seasonality at individual stations for the time period September to December and removed these effects. Comparing these seasonality effects defined from noise with our EQ results shows that foreshocks/aftershocks for the 2016 Norcia mainshock are well resolved. Training with data containing EQ offers a huge increase in classification performance over noise only, proving that EQ signals are the sole that enable assessing timing as a function of the fault status. To confirm our results and understand which stations are able to detect changes of fault properties we perform a further test cleaning the signals from the seasonality by confounding the DL with a shuffled noise (adversarial training). We conclude that DL is able to recognize variations in the stress state and fracture during the seismic cycle. The model uses EQ-induced changes in seismic attenuation to distinguish foreshocks from aftershocks and time to failure. This is an important step in ongoing efforts to improve EQ prediction and precursor identification through the use of ML and DL.
Seismic waves contain information about the earthquake (EQ) source and many forms of noise deriving from the seismometer, anthropogenic effects, background noise associated with ocean waves, and microseismic noise. Separating the noise from the EQ signal is a critical first step in EQ physics and seismic waveform analysis. However, this is difficult because optimal parameters for filtering noise typically vary with time and may strongly alter the shape of the waveform. A few recent works have employed Deep Learning (DL) model for seismic denoising, among which we have taken as a benchmark Deep Denoiser and SEDENOSS. These models turn the noisy trace into a 2D signal (spectrograms) within the model to denoise the traces, making the process pretty heavy. We propose a novel DL-powered seismic denoising algorithm based on Diffusion Models (DMs), keeping the signal in 1D. DMs are the latest trend in Machine Learning (ML), having revolutionized the application fields of audio and image processing for denoising (DiffWave), synthesis (Stable Diffusion), and sequence modeling (STARS). The training of DMs proceeds by polluting a signal with noise until the signal has completely vanished into noise, then reversing the process by iterative denoising, conditioned on the latent signal representation. This makes DMs the ideal tool for seismic traces cleaning, as the model naturally learns from seismic sequences by denoising, which aligns the ML training procedure and the final task objective. In a preliminary evaluation, we used the Stanford Earthquake Dataset (STEAD); our proposed Diffusion-based Seismic Denoiser (DiffSD) outperforms the state-of-the-art DL methods on the Signal Noise Ratio (SNR), Scale-Invariant Source to Distortion Ratio (SI-SDR), and Source to Distortion Ratio (SDR) metrics. DiffSD also yields qualitatively pleasing EQ traces out of visual inspection in time and frequency. Finally, DiffSD proceeds from regenerating clean EQ signals from noise, which opens the way to data-driven EQ sequence generations, potentially instrumental to further study and dataset augmentations.
For the task of semantic segmentation (SS) under domain shift, active learning (AL) acquisition strategies based on image regions and pseudo labels are state-of-the-art (SoA). The presence of diverse pseudo-labels within a region identifies pixels between different classes, which is a labeling efficient active learning data acquisition strategy. However, by design, pseudo-label variations are limited to only select the contours of classes, limiting the final AL performance. We approach AL for SS in the Poincaré hyperbolic ball model for the first time and leverage the variations of the radii of pixel embeddings within regions as a novel data acquisition strategy. This stems from a novel geometric property of a hyperbolic space trained without enforced hierarchies, which we experimentally prove. Namely, classes are mapped into compact hyperbolic areas with a comparable intra-class radii variance, as the model places classes of increasing explainable difficulty at denser hyperbolic areas, i.e. closer to the Poincaré ball edge. The variation of pixel embedding radii identifies well the class contours, but they also select a few intra-class peculiar details, which boosts the final performance. Our proposed HALO (Hyperbolic Active Learning Optimization) surpasses the supervised learning performance for the first time in AL for SS under domain shift, by only using a small portion of labels (i.e., 1%). The extensive experimental analysis is based on two established benchmarks, i.e. GTAV → Cityscapes and SYNTHIA → Cityscapes, where we set a new SoA. The code will be released.
Detecting the anomaly of human behavior is paramount to timely recognizing endangering situations, such as street fights or elderly falls. However, anomaly detection is complex since anomalous events are rare and because it is an open set recognition task, i.e., what is anomalous at inference has not been observed at training. We propose COSKAD, a novel model that encodes skeletal human motion by a graph convolutional network and learns to COntract SKeletal kinematic embeddings onto a latent hypersphere of minimum volume for Video Anomaly Detection. We propose three latent spaces: the commonly-adopted Euclidean and the novel spherical and hyperbolic. All variants outperform the state-of-the-art on the most recent UBnormal dataset, for which we contribute a human-related version with annotated skeletons. COSKAD sets a new state-of-the-art on the human-related versions of ShanghaiTech Campus and CUHK Avenue, with performance comparable to video-based methods. Source code and dataset will be released upon acceptance.
Considering the increasing aging of the population, multi-device monitoring of the activities of daily living (ADL) of older people becomes crucial to support independent living and early detection of symptoms of mental illnesses, such as depression and Alzheimer’s disease. Anomalies can anticipate the diagnosis of these pathologies in the patient’s normal behavior, such as reduced hygiene, changes in sleep habits, and fewer social interactions. These abnormalities are often subtle and hard to detect. Especially using non-intrusive monitoring devices might cause anomaly detectors to generate false alarms or ignore relevant clues. This limitation may hinder their usage by caregivers. Furthermore, the notion of abnormality here is context and patient-dependent, thus requiring untrained approaches. To reduce these problems, we propose a self-supervised model for multi-sensor time series signals based on Hyperbolic uncertainty for Anomaly Detection, which we dub HypAD. HypAD estimates uncertainty end-to-end, thanks to hyperbolic neural networks, and integrates it into the ”classic” notion of reconstruction loss in anomaly detection. Based on hyperbolic uncertainty, HypAD introduces the principle of a detectable anomaly. HypAD assesses whether it is sure about the input signal and fails to reconstruct it because it is anomalous or whether the high reconstruction loss is due to the model uncertainty, e.g., a complex but regular signal (cf. this parallels the residual model error upon training). The proposed solution has been incorporated into an end-to-end ADL monitoring system for elderly patients in retirement homes, developed within a funded project leveraging an interdisciplinary consortium of computer scientists, engineers, and geriatricians. Healthcare professionals were involved in the design and verification process to foster trust in the system. In addition, the system has been equipped with explainability features.
Scene-aware global human motion forecasting is critical for manifold applications, including virtual reality, robotics, and sports. The task combines human trajectory and pose forecasting within the provided scene context, which represents a significant challenge. So far, only Mao et al. NeurIPS’22 have addressed scene-aware global motion, cascading the prediction of future scene contact points and the global motion estimation. They perform the latter as the end-to-end forecasting of future trajectories and poses. However, end-to-end contrasts with the coarse-to-fine nature of the task and it results in lower performance, as we demonstrate here empirically. We propose a STAGed contact-aware global human motion forecasting STAG, a novel three-stage pipeline for predicting global human motion in a 3D environment. We first consider the scene and the respective human interaction as contact points. Secondly, we model the human trajectory forecasting within the scene, predicting the coarse motion of the human body as a whole. The third and last stage matches a plausible fine human joint motion to complement the trajectory considering the estimated contacts. Compared to the state-of-the-art (SoA), STAG achieves a 1.8% and 16.2% overall improvement in pose and trajectory prediction, respectively, on the scene-aware GTA-IM dataset. A comprehensive ablation study confirms the advantages of staged modeling over end-to-end approaches. Furthermore, we establish the significance of a newly proposed temporal counter called the "time-to-go", which tells how long it is before reaching scene contact and endpoints. Notably, STAG showcases its ability to generalize to datasets lacking a scene and achieves a new state-of-the-art performance on CMU-Mocap, without leveraging any social cues. Our code is released at: this https URL
Forecasting weather systems are capable to model atmospheric phenomena at various space-time scales. At very short space-time nowcasting techniques are still relying on measured data processing from ground-based microwave radars and satellite-based geostationary spectrometers. In this respect, precipitation field nowcasting from a few minutes up to a few hours is one of the most challenging goals to provide rapid and accurate updated features for civil prevention and protection decision-makers (e.g., from emergency services, marine services, sport, and cultural events, air traffic control, emergency management, agricultural sector and moreover flood early-warning system). Deep learning precipitation nowcasting models, based on weather radar network reflectivity measurements, have recently exceeded the overall performance of traditional extrapolation models, becoming one of the hottest topics in this field. This work proposes a novel network architecture to increase the performance of deep learning mesoscale precipitation prediction. Since precipitation nowcasting can be viewed as a video prediction problem, we present an architecture based on Graph Convolutional Neural Network (GCNN) for video frame prediction. Our solution exploits, as a cornerstone, the topology of Space-Time-Separable Graph-Convolutional- Network (STS-GCN), originally used for posing forecasting. We have applied our model on the TAASRAD19 radar data set with the aim of comparing our performance with other models, namely the Stacked Generalization (SG) Trajectory Gated Recurrent Unit (TrajGRU) and S-PROG Spectral Lagrangian extrapolation program (S-PROG).The proposed model, named STSU-GCN (Space-Time-Separable Unet3d Graph Convolutional Network), has a structure composed of an encoder, decoder, and forecaster. The role of the encoder and decoder are accomplished by a Unet3d a structure borrowed with the specific purpose of modifying the spatial component, but not the temporal component. In the bottleneck of this Unet3D network, we use a graph-based forecaster. The performance of the STSU-GCN has been quantified using conventional metrics, such as the Critical Success Index (CSI), widely used in the meteorological community for the nowcasting task. Using TAASRAD19 radar data set and literature data, these CSI metrics have been applied to 4 different classes of rain rate, that is 5, 10, 20, 30 mm/h. Our STSU-GCN model has overperformed both TrajGRU and S-PROG in the classes 10 mm/h and 20 mm/h obtaining a CSI respectively of 0.148 and 0.097. On the other hand, STSU-GCN is underperforming in class 5mm per hour getting a CSI respectively of 0.099. Our STSU-GCN model is aligned with the results of the S-PROG benchmark, for the class 30 mm/h confirming a model skillful for classes with a high rain rate. In this work, we will also illustrate the results of the proposed STSU-GCN algorithm using case studies in the area of interest of the Italian Central Apennines during the summer of 2021. Statistical performances, potential developments, and critical issues of the STSU-GCN algorithm will be also discussed.
Earthquake forecasting and prediction have long and in some cases sordid histories but recent work has rekindled interest based on advances in early warning, hazard assessment for induced seismicity and successful prediction of laboratory earthquakes. In the lab, frictional stick-slip events provide an analog for earthquakes and the seismic cycle. Labquakes are also ideal targets for machine learning (ML) because they can be produced in long sequences under controlled conditions. Indeed, recent works show that ML can predict several aspects of labquakes using fault zone acoustic emissions (AE). Here, we extend these works with: 1) deep learning (DL) methods for labquake prediction, 2) by introducing an autoregressive (AR) forecasting DL method to predict fault zone shear stress, and 3) by expanding the range of lab fault zones studied. The AR methods allow forecasting stress at future times via iterative predictions using previous measurements. Our DL methods outperform existing ML models and can predict based on limited training. We also explore forecasts beyond a single seismic cycle for aperiodic failure. We describe significant improvements to existing methods of labquake prediction and demonstrate: 1) that DL models based on Long-Short Term Memory and Convolution Neural Networks predict labquakes under conditions including pre-seismic creep, aperiodic events and alternating slow/fast events and 2) that fault zone stress can be predicted with fidelity, confirming that acoustic energy is a fingerprint of fault zone stress. Our DL methods predict time to start of failure (TTsF) and time to the end of Failure (TTeF) for labquakes. Interestingly, TTeF is successfully predicted in all seismic cycles, while the TTsF prediction varies with the amount of preseismic fault creep. We report AR methods to forecast the evolution of fault stress using three sequence modelling frameworks: LSTM, Temporal Convolution Network and Transformer Network. AR forecasting is distinct from existing predictive models, which predict only a target variable at a specific time. The results for forecasting beyond a single seismic cycle are limited but encouraging. Our ML/DL models outperform the state-of-the-art and our autoregressive model represents a novel framework that could enhance current methods of earthquake forecasting.
Pushing back the frontiers of collaborative robots in industrial environments, we propose a new Separable-Sparse Graph Convolutional Network (SeS-GCN) for pose forecasting. For the first time, SeS-GCN bottlenecks the interaction of the spatial, temporal and channel-wise dimensions in GCNs, and it learns sparse adjacency matrices by a teacher-student framework. Compared to the state-of-the-art, it only uses 1.72% of the parameters and it is ∼4 times faster, while still performing comparably in forecasting accuracy on Human3.6M at 1 s in the future, which enables cobots to be aware of human operators. As a second contribution, we present a new benchmark of Cobots and Humans in Industrial COllaboration (CHICO ). CHICO includes multi-view videos, 3D poses and trajectories of 20 human operators and cobots, engaging in 7 realistic industrial actions. Additionally, it reports 226 genuine collisions, taking place during the human-cobot interaction. We test SeS-GCN on CHICO for two important perception tasks in robotics: human pose forecasting, where it reaches an average error of 85.3 mm (MPJPE) at 1 sec in the future with a run time of 2.3 ms, and collision detection, by comparing the forecasted human motion with the known cobot motion, obtaining an F1-score of 0.64.
Unsupervised Domain Adaptation (UDA) is a key issue in visual recognition, as it allows to bridge different visual domains enabling robust performances in the real world. To date, all proposed approaches rely on human expertise to manually adapt a given UDA method (e.g. DANN) to a specific backbone architecture (e.g. ResNet). This dependency on handcrafted designs limits the applicability of a given approach in time, as old methods need to be constantly adapted to novel backbones. Existing Neural Architecture Search (NAS) approaches cannot be directly applied to mitigate this issue, as they rely on labels that are not available in the UDA setting. Furthermore, most NAS methods search for full architectures, which precludes the use of pre-trained models, essential in a vast range of UDA settings for reaching SOTA results. To the best of our knowledge, no prior work has addressed these aspects in the context of NAS for UDA. Here we tackle both aspects with an Adversarial Branch Architecture Search for UDA (ABAS): i. we address the lack of target labels by a novel data-driven ensemble approach for model selection; and ii. we search for an auxiliary adversarial branch, attached to a pre-trained backbone, which drives the domain alignment. We extensively validate ABAS to improve two modern UDA techniques, DANN and ALDA, on three standard visual recognition datasets (Office31, Office-Home and PACS). In all cases, ABAS robustly finds the adversarial branch architectures and parameters which yield best performances. https://github.com/lr94/abas
Human pose forecasting is a complex structured-data sequence-modelling task, which has received increasing attention, also due to numerous potential applications. Research has mainly addressed the temporal dimension as time series and the interaction of human body joints with a kinematic tree or by a graph. This has decoupled the two aspects and leveraged progress from the relevant fields, but it has also limited the understanding of the complex structural joint spatio-temporal dynamics of the human pose. Here we propose a novel Space-Time-Separable Graph Convolutional Network (STS-GCN) for pose forecasting. For the first time, STS-GCN models the human pose dynamics only with a graph convolutional network (GCN), including the temporal evolution and the spatial joint interaction within a single-graph framework, which allows the cross-talk of motion and spatial correlations. Concurrently, STS-GCN is the first space-time-separable GCN: the space-time graph connectivity is factored into space and time affinity matrices, which bottlenecks the space-time cross-talk, while enabling full joint-joint and time-time correlations. Both affinity matrices are learnt end-to-end, which results in connections substantially deviating from the standard kinematic tree and the linear-time time series. In experimental evaluation on three complex, recent and large-scale benchmarks, Human3.6M [Ionescu et al. TPAMI’14], AMASS [Mahmood et al. ICCV’19] and 3DPW [Von Marcard et al. ECCV’18], STS-GCN outperforms the state-of-the-art, surpassing the current best technique [Mao et al. ECCV’20] by over 32% in average at the most difficult long-term predictions, while only requiring 1.7% of its parameters. We explain the results qualitatively and illustrate the graph interactions by the factored joint-joint and time-time learnt graph connections. Our source code is available at https://github.com/FraLuca/STSGCN