Can anyone here comment on SSIM vs MSE ?
Does SSIM indeed perform better than MSE ?
Thanks so much,
Aaron
Can anyone here comment on SSIM vs MSE ?
Does SSIM indeed perform better than MSE ?
Thanks so much,
Aaron
You could use something like this:
https://github.com/pornel/dssim
to take care of the color problem. It works in L*a*b* color space, which may not be perfect but should at least be perceptually better than RGB, YCgCb or YCoCg.
Not really. It's mostly different, but not really better. SSIM alone (without multiscale) is little more than PSNR with a simplistic masking term, i.e. it is not hard to see that
1-SSIM(x,y) \approx = MSE / ( 1 + x^2 + y^2) (IIRC).
This denominator is called "visual masking" in the spatial domain, with a masking exponent of two. Well, while this is seems probably better than nothing, the masking exponent does not quite fit the result of subjective experiments - the masking exponent in the DC domain is somewhere between 1 and 2, and the masking exponent for higher frequencies is even below 1.
Even worse, SSIM does not model the CSF - contrast sensitivity function - which is a very well-known effect in the frequency domain.
Multiscale-SSIM does have "some" way to model the CSF, but masking is still off.
One way or another, if you talk to the folks that do such subjective experiments, SSIM or MSSIM is really not much better than MSE or PSNR, don't trust it.
If you're looking for a *good* subjective quality index, go for VDP or hdrVDP - this is worth something - but it requires proper calibration and a couple of parameters that describe the display luminance, the viewing distance and the color space of the image (naturally!).
That SSIM, MSE and a lot of other attempts do not require such parameters should at least make you suspicious.
Oh, SSIM is so simple, it's not hard to "fix it" for color. It's bad enough, so a very simple modification makes it applicable to color: Convert the RGB image to some sort of opponent-color space (similar to what your brain does), then measure SSIM on the luminance and the opponent color coordinates, with a large weight on luminance and small weights on the two chroma coordinates. A very simple approach that works "well enough" is simply to convert to YCbCr, and use a weight of 0.8 on luma and 0.1 on Cb and Cr.
Then, it is also not too hard to create a JPEG 2000 encoder that is optimal in SSIM (not PSNR) - I did this a while ago for a DCC paper - and check the resulting images. Indeed, the SSIM score goes through the roof, but the images do not really look better. They only look "different" (neither better nor worse, just other artifacts).
The problem here is really that SSIM lacks a good physiological model. A simple CSF modification of JPEG 2000 plus a simple masking model yields better results (subjectively) than a SSIM-optimized encoder.
Yes. Been there, done that. You find the results here:
Th. Richter, K.J. Kim: "A MS-SSIM Optimal JPEG 2000 Encoder", Proc. of Data Compression Conference, Snowbird, 2009. (You'll find it on the IEEE Explorer)
If you're interested, I can send you the paper. I should even have the software - back them, Kim and I implemented this on the JJ2000 not on Accusoft's proprietary coder.
boxerab (21st December 2015),Jon Sneyers (21st December 2015)
Thanks for your insights, Thomas. Yes, I would very much like to see a copy of your paper.
Aaron
We have just opensourced butteraugli, a new non-parametric method for estimating the noticeability of lossy compression artefacts.
https://github.com/google/butteraugli
boxerab (27th March 2016),Jon Sneyers (13th February 2016)
I ran butteraugli's compare_pngs on the Lena image (original and compressed/decompressed with webp).
I got:
original/original: 0
original/webp default: 3.77
original/webp q90: 2.31
original/webp q95: 1.96
original/empty png: 44.68
It all looks sensible but I think the cutoff kButteraugliBad = 2.095 is too strong because the default webp compression yields a good image (hard to spot the errors at first sight) yet returns 3.77.
It is also a bit hard to make sense of the result. Is there a way to linearize the returned values ? Do you have metrics against a variety of images ?
A value below 1.6 is great, a value below 2.1 okeyish. Above 2.1 there is likely a noticeable artefact in an inplace flip test. You can try the flip test with your images.
Typically we saw good results in JPEG with quality 92 to 94 in YUV444 mode. I am not an expert on the performance of WebP lossy part, but I believe it is not performing very well for the highest possible quality, like jpegs above quality 92 in YUV444.
Jon Sneyers (3rd March 2016)
Butteraugli is rather slow. Do you have an idea how much room for speedup there is?
What's the story behind the name butteraugli ?
I wanted more vowels than there are in PSNRHVS-M and MS-SSIM-YUV, an association with the human eye... and with a small bread (gipfeli, zopfli, brotli). I deliberately chose an overly complex term to avoid creating homonym noise for something as specific as this.
Voisilmäpulla, Finnish butter eye buns, translated to German is something like Butteraugebrötchen, and I took the liberty to invent a new pseudo-Swiss-German word from it, and butteraugli was born.
http://www.food.com/recipe/finnish-b...m-pulla-326192 -- Tasty with filter coffee.
Thanks for the explanationNow I am getting hungry ........
BTW, there is a division by zero in Average5x5() when (x,y) points to the lower right corner of an image (n == 0).
Jyrki, I don't see any information about the development of butteraugli.
Did you test it against some ground truth of human observation tests?
Last edited by cbloom; 3rd August 2016 at 20:12.
We have a small internal test of ~2000 image pairs, and about 10 specially-designed calibration images for those psychovisual effects we have chosen to model in butteraugli.
We did compare with TID2008 and TID2013, but that database seeks to answer a different question "How awful is the degradation?", while ours concentrates on "Can you notice the degradation?"
Our analysis is not water-tight, but only indicative, partly because we use all our data to optimize the model. I will write a better report about it within the next two months, as well as push a new improved version.
It is relatively "easy" to come up with a good "sub-threshold" quality index (i.e. the question your tool wants to answer). The effects are pretty much known: Visual making, CSF, Cortex filter. If so, I would really really recommend to look into VDP-2 because it has a pretty elaborate model. Unfortunately, due to the filters, it can only model self-masking, and it has no idea about color and the chroma CSF. Super-threshold (i.e. "how awful is the compression") is a much harder question, and it is also observer-dependent (do you prefer block defects or a blurry picture? Entirely subjective). There are a couple of algorithms that claim to work in this domain (SSIM is one of them), though in reality, in the tests I made back then, VDP-2 still works best in this domain. I wouldn't focus on a single dataset. TID2013 might be ok, but I would also suggest to look into LIVE2, and I would certainly also suggest to make subjective tests. If you're interested, I'm here in contact with a group of people/labs in Europe named "Qualinet" that do that on a professional basis (i.e. run subjective tests), and there are also a couple of good papers on crowdsourcing subjective evaluation (with all potential dangers of this approach). Should be in the proceedings of the Qomex. (Which is probably the conference I would recommend to you the most in this case).
Jyrki Alakuijala (31st March 2016)
Unfortunately, no. There is a C++ implementation of an older version that is open source and ready to download, and there is a closed-source C++ implementation that was done under the administration of Dolby (and for which I got the promise that it would be open sourced at some point), and there is the open source matlab implementation you surely know. I currently use the matlab implementation with a bash-wrapper around so I can use it in my automated tests, but indeed, this situation is not ideal.
I'd love to see the power of Google used to generate a new large-scale human rating database of image distortions.
Something like a little web page that shows the original & two distortions, and the human picks which one looks best. Or maybe rates them. I'm not sure.
I spent some time on this problem before, and I *think* my solution was pretty good, but I decided that without better validation data it's all a bit questionable.
At the moment there are way too many metrics and not a clear way to tell that they are working.
The other really big problem that's not solved well at the moment is a more perceptual metric that can be used in-loop for R/D optimization. All the perceptual metrics are much too slow for this. The only attempt I even know of in this domain is x264's SATD hack, which is a big improvement over just using SAD or SSD, but surely there's something better.
Last edited by cbloom; 3rd August 2016 at 20:12.
Actually, it's possible to include MS-SSIM in JPEG 2000 - that's not too had and doable. It's just that the results are not too much different (MS-SSIM scores are great, but the images do not look much better, they look different). Of course, you've always to make compromises. But elements like visual masking or visual weighting (CSF) are not hard to add to a JPEG 2000 encoder.