This paper addresses the problem of automatic facial expression recognition in videos, where the goal is to predict discrete emotion labels best describing the emotions expressed in short video clips. Building on a pre-trained convolutional neural network ( CNN) model dedicated to analyzing the video frames and LSTM network designed to process the trajectories of the facial landmarks, this paper investigates several novel directions. First of all, improved face descriptors based on 2D CNNs and facial landmarks are proposed. Second, the paper investigates fusion methods of the features temporally, including a novel hierarchical recurrent neural network combining facial landmark trajectories over time. In addition, we propose a modification to state-of-the-art expression recognition architectures to adapt them to video processing in a simple way. In both ensemble approaches, the temporal information is integrated. Comparative experiments on publicly available video-based facial expression recognition datasets verified that the proposed framework outperforms state-of-the-art methods. Moreover, we introduce a near-infrared video dataset containing facial expressions from subjects driving their cars, which are recorded in real world conditions.