Understanding Box-and-Whisker Plot

Akshada Gaonkar
The Startup
Published in
4 min readFeb 19, 2021

--

Statistics being completely based on mathematics, helps us gain strong insights into the structure of data and come up with concrete solutions instead of guesstimates. This makes statistics a vital part of Data Science.

Let us understand what the basic statistical terms mean..

  • Mean: it is the average that we’ve been calculating since school, diving the total sum of all the values by the number of values.
  • Median: is the number that lies exactly in the center when all values are arranged in ascending order.
  • Standard deviation & Variance: the magnitude of deviation of data points from the mean value.
  • Range: it is the difference between the largest and the smallest value in the data.
  • Outliers: values that differ extremely from the normal range or majority of the data. They lie an abnormal distance away from other data points.
  • Skewness: measures the symmetry of data. Data is said to be skewed if its distribution is asymmetrical.

Now that we know what we’re looking for in the data, let’s move forward and understand what a box plot is and how it helps.

Box plot or Box-and-Whisker plot is one of the most popularly used methods to statistically visualize data.

Source: https://www.simplypsychology.org/boxplot.jpg

It shows us a 5-number summary - minimum, first quartile, median, third quartile and maximum.

  1. Minimum: the lowest value excluding the outliers.
  2. First quartile or Lower quartile (Q1): 25% of all the values in the data lie below this value.
  3. Median: represented by a line in the box. It is value exactly at the centre of the data, i.e. a value below which lies 50% of the data points.
  4. Third quartile or Upper quartile (Q3): value below which lie 75% of all the data points.
  5. Maximum: the highest value excluding the outliers.

Upper and lower quartile values help us find the Inter-quartile Range (IQR). IQR consists of 50% of the data points.
The long whiskers, tails extending from the box and the outliers depict the remaining 50%. Box plot also helps us know if our data consists of outliers. Beyond the whiskers lie the outliers.

What other questions does this plot answer? You obviously need some more insights, don’t you?

Let’s see!

Say we have Math scores of 4 groups of students with 20 students in each group. Have a look at the box plots depicting these scores.

Source: author

What can we interpret about the variation in data?

Smaller box means many values lie in a small range, i.e. most of the data points have similar values. Larger the box, the wider the range over which the values are spread, i.e. data points significantly vary from each other.
Similarly, the longer the whiskers, the higher the standard deviation and variance, and vice versa.

In our example, students of group B have highly varied scores from a minimum of around 10 to a maximum of 100, whereas, scores of Group A students mainly vary between 40 and 100, most students having scored between 60 and 80.

Is the data normally distributed or skewed? If the data is skewed then in which direction?

Source: https://www.simplypsychology.org/box-plots-distribution.jpg

If the distribution is normal, the mean will be nearly the same as the median. In this case, the box plot will look symmetric with whiskers on both sides equally long.

If most of the data points are large and few are very small compared to the large values, the distribution is right-skewed (Median > Mean). In this case, the box plot looks as if it is shifted to the left with a long right whisker and a short right whisker.

Lastly, if most of the data points are small and few are very large compared to the smaller values, the distribution is left-skewed (Mean > Median). This will make the box plot look like it shifted to the right, i.e. it’ll have a long left whisker and a short right whisker.

Going back to our example, we can say that Group B has a score distribution that is left-skewed, Group C has right-skewed distribution and Group D has a distribution that is almost normal.

These long whiskers may act like outliers and so skewness needs to be taken care of by using appropriate transformation methods.

--

--

Akshada Gaonkar
The Startup

Intern at SAS • MTech Student at NMIMS • Data Science Enthusiast!