Handling safety is crucial to achieve lifelong autonomy for robots. Unsafe situations might arise during manipulation in unstructured environments due to noises in sensory feedback, improper action parameters, hardware limitations or external factors. In order to assure safety, continuous execution monitoring and failure detection procedures are mandatory. To this end, we present a multimodal failure monitoring and detection system to detect manipulation failures. Rather than relying only on a single sensor modality, we consider integration of different modalities to get better detection performance in different failure cases. In our system, high level proprioceptive, auditory and visual predicates are extracted by processing each modality separately. Then, the extracted predicates are fused. Experiments on a humanoid robot for tabletop manipulation scenarios indicate that the contribution of each modality is different depending on the action in execution, and multimodal fusion results in an overall performance increase in detecting failures compared to the performance attained by unimodal processing.