A robot with audiovisual attention abilities can distinguish between multiple signal sources in the environment and tackle down only the object of interest based on such information. Here, the design for such a system is introduced in a two level hierarchy. The low level modules are visual and auditory perception modules in which the later relies on the concept of incremental learning to classify the sound signals. The second-level module fuses the information provided in the low level ones to achieve a representation of audiovisual objects. The developmental structure is established in two phases. The first phase (exposure) establishes the base knowledge for the second level module through offline training. The second phase (reappraisal) allows the system to correct a false guess of an instance via storing the wrong guessed instances in a doubt memory to go back to later after more knowledge has been obtained by the system. The proposed approach has been verified on a custom made dataset of audiovisual information through the means of accuracy measured on each stage of the system behavior and the increase in it obtained in each phase.