Predicting interestingness of media content remains an important, but challenging research subject. The difficulty comes first from the fact that, besides being a high-level semantic concept, interestingness is highly subjective and its global definition has not been agreed yet. This paper presents the use of up-to-date deep learning techniques for solving the task. We perform experiments with both social-driven (i.e., Flickr videos) and content-driven (i.e., videos from the MediaEval 2016 interestingness task) datasets. To account for the temporal aspect and multimodality of videos, we tested various deep neural network (DNN) architectures, including a new combination of several recurrent neural networks (RNNs), to handle several temporal samples at the same time. We then investigated different strategies for dealing with unbalanced datasets. Multimodality, as the mid-level fusion of audio and visual information, brought benefit to the task. We also established that social interestingness differs from content interestingness.