Despite changing conditions and mobile objects in its environment, the goal of an autonomous robot is to navigate without any human operator. For achieving this, the knowledge of both the dynamic state of the robot and the 3D structure of the environment must be known. While discretizing a continuous image sequence, three types of motion can be observed: 1) a mobile camera (or a robot as in this study) capturing a static scene, 2) independent moving objects in front of a static camera, or 3) simultaneously moving camera and independent objects in the environment at any given time. One of the challenges is the existence of one or multiple mobile objects in the environment which have independent velocity and direction with respect to the environment. In this work, a visual ego-motion estimation with unsupervised learning for robots by using stereo video taken by a camera mounted on them is introduced. Also, audio perception is utilized to support visual ego-motion estimation by sensor fusion in order to identifying the source of the motion. We have verified the effectiveness of our approach by conducting three different experiments in which both robot and object, sole robot and sole object motion were present.