Accurate determination of thermospheric neutral density holds crucial importance for satellite drag calculations. The problem is twofold and involves the correct estimation of the quiet time climatology and storm time variations. In this work, neutral density estimations from two empirical and three physics-based models of the ionosphere-thermosphere are compared with the neutral densities along the Challenging Micro-Satellite Payload satellite track for six geomagnetic storms. Storm time variations are extracted from neutral density by (1) subtracting the mean difference between model and observation (bias), (2) setting climatological variations to zero, and (3) multiplying model data with the quiet time ratio between the model and observation. Several metrics are employed to evaluate the model performances. We find that the removal of bias or climatology reveals actual performance of the model in simulating the storm time variations. When bias is removed, depending on event and model, storm time errors in neutral density can decrease by an amount of 113% or can increase by an amount of 12% with respect to error in models with quiet time bias. It is shown that using only average and maximum values of neutral density to determine the model performances can be misleading since a model can estimate the averages fairly well but may not capture the maximum value or vice versa. Since each of the metrics used for determining model performances provides different aspects of the error, among these, we suggest employing mean absolute error, prediction efficiency, and normalized root mean square error together as a standard set of metrics for the neutral density.