I've been asked to standardise the dataset that I used in my recent benchmark. I refused to do it because I think that it's substandard, which is caused by numerous series of similar images.
Also, there are things that I did and I don't know if they were right and I'd like to ask what others think about it.
During selection, I assigned equal probability to all images. Was it right? I don't know, but there were vast differences between numbers of images from different sites. Quite a few didn't host any PNGs, while some others (especially Chinese ones) had tons of them. Wouldn't it be better to apply some weights? Or shouldn't front page images get higher weight then ones below? Or maybe some weighting based on position in the ranking? And how to treat duplicates? Shouldn't they be used to adjust the weight? For example Google logo got selected, but I think that its chance selection was way lower then its importance.