DVS - A Tapestry of Diversity

image-2.png

CREATED BY: Daniel Zvinca

DESCRIPTION:

This is, imo, a refined version of the original post, with the same message, DVS is heterogeneous. If the original salt-and-pepper approach suggested quite well the heterogeneity, this version provides as bonus a visual aggregation of visual skills (pivotal criteria of DVS) across the other criteria distributions. I guess an animation that would move the dots from the visual distribution to data distribution pile and then to society pile would work great.

As mentioned, I am using my own developed KDE dot plot, a form of display that uses the kernel density influence to alter values position within a given tolerance. Unlike Mike Bostock’s d3 smooth beeswarm approach, which uses physics interactions (collisions) to build a “natural” pile look, this has not only a consistent statistical approach, but it also preserves the values order (more about it on my LinkedIn article).

The “classic” visual estimation of correlation based on scatterplots rarely works for a general audience. Not only that, but the scatterplot method will also fail for many overlapped values (DVS dataset) or large datasets. For me, more importantly, a scatterplot is loosing the individual distribution perception (though, marginal distribution representations are possible).

The visual estimation of the spread of equally sized ranges of one distribution in another using a sequential color palette works very well for large amount of data. If the count is above a few hundreds, then kde dot plot provides also a cardinality sense and individual interactivity. Above several thousands, density plots or histograms can be used using the same sequential color palette logic. The range size is influenced by the amount of distinct colors we are able to identify at a glance, 4, 5 colors work just fine, 10 is a stretch. However, have been given the smooth aspect of density graphs, for large enough graph size, hairline curves with a higher contrast can be used to better delimitate the “layers”, so I guess 10 will also work. I might write a detailed post about it that will also look into the alternative of using percentiles (quartiles, quantiles or deciles) color rules instead of equal sized ranges, we’ll see.

The visual perception I am counting on is this. *Two correlated distributions will show a more consistent color similarity across ranges, while two uncorrelated distributions will show a visible spread of each of the ranges (or percentiles if that is used for color encoding)*. A scatterplot visual inspection might still be the reference method or preferred choice for statisticians, but that is not without limitations.

I probably need to put a bit more thoughts into this, and that should include a good name. It is a technique rather than a design, it can be used for kde dot plots, density graphs, histograms. It is trying to provide a visual answer to the question “how much two distributions differ?” That goes beyond the overall shape similarity, it goes into contents. That might be a slightly different question than “how correlated are two variables”…

PROCESS FOR CREATION:

I am using no existing package, just a bit of programming in C/C++, but any scientific language (R, Python) will work just fine.