What do we mean by charts and graphs? We are going to use these terms to broadly refer to graphical representations of data. You are already familiar with many types of charts and graphs such as the food pyramid, a chart used to represent how much of each food group you should eat, or the battery symbol on a screen, a chart that represents your device's battery level. In this lesson, we are going to cover some commonly used graphs and charts. We will cover histograms, bar graphs, pie charts, scatterplots, line charts, and box plots. Being able to analyze each one correctly will allow you to draw meaningful conclusions from sets of data. It’s also important to keep in mind that a single data set can usually be represented using more than one type of chart/graph.
Histograms are used to represent the distribution of a data set over a continuous domain. The domain is broken down into intervals, commonly referred to as bins. These bins each make up some range of the variable being measured. This variable could be a time, weight, price, etc. The applications of histograms are practically endless and their uses range from scientific research to everyday business and daily activity. They are a clear way of representing the distribution of a data set.
Now let’s look at an example.
This histogram is being used to represent the distribution of minutes that the last 66 cars spent waiting at a railroad crossing. Imagine the data set as a list made up of the specific time each car spent waiting at the crossing.
This histogram has the number of cars as the y-axis unit and minutes spent waiting as the x-axis unit. Each bin indicates how many cars spent waiting at a time in that range. Therefore, the value of all the bins should add up to the total of 66 cars. You can think of the edges of each bin as representing the outer limits of the bin’s range.
This same histogram could also be displayed as follows:
Notice that that the same data is represented on this histogram, but the way the bins are labeled is slightly different. Rather than listing the intervals for each bin, this histogram only lists the endpoints of the bins.
Now let’s go over some basic analysis of this histogram. As with any data set, we can use the histogram to help us find the mode, median, and mean
So, what would be the mode of this histogram? The mode is whichever value has the greatest frequency; in other words, which value shows up the most. In this case, the mode is the bin making up the 8-9 minute interval because the greatest number of cars (14) waited for 8 to 9 minutes.
How about the median of this histogram? Remember that the median is the middle value of a data set. Think about listing out all the times for the 66 cars. The median would be whatever value is exactly in the middle.
Notice how there are 33 cars (half the total 66) on either side of the dashed line, which is being used to represent the median waiting time. This means the median waiting time would be 7 minutes.
How about the mean, or average, of this histogram? This would be the average time spent waiting at the crossing. With a histogram, the mean can be calculated by multiplying the middle value of each bin by that bins frequency and then adding together the calculated values of all the bins. You can see how this can be done on the histogram below.
Notice how in red, the middle value of each bin’s interval is multiplied by each bin’s respective height. Once this is done for all bins, we add the values together and then divide by the total number of cars. In this case, we have 66 total cars. The following represents this complete calculation for the mean.
The answer comes out to an average time of 6.54 minutes waiting at the railroad crossing.
As you now know, histograms are used for grouping data into intervals, or “bins”. The height of each bin can be used to determine that bin's frequency out of the whole data set. It is important to remember that histograms are just an approximate representation and that they are good for looking at trends and distribution for a data set, but they are not good at giving specific details on individual pieces of data from the set.
Bar graphs are a great way to break down a series of data into different categories with each category being proportionally represented by a bar. This means that the height of each bar represents the value of that category, with the units indicated on the opposite axis. Bar graphs can also be a great way to compare multiple series of data paired into the same unique categories.
The following bar graph shows the breakdown of a person’s monthly expenses grouped into 8 different categories: Food, Transportation, Housing, Entertainment, Pets, Phone/Internet, Shopping, and Saving. The amount of spending in each category is represented in dollars by the y-axis.
What can we determine from this bar graph?
We can also make a bar graph showing a monthly budget breakdown for multiple people with the same categories. This allows us to compare two different data sets using different colored bars.
Now what can we determine from this graph?
As you can see, bar graphs are useful when comparing different categories, as well as between multiple sets of data.
Pie charts are one of the most basic and common ways to display the proportions of each data set relative to the whole. This is done by making a circle, or “pie,” and dividing it into slices that proportionally represent each data set’s contribution to the whole. The slices of a pie chart will often be labeled with the percentage of the pie each slice comprises. These charts are an intuitive way to represent the distribution of a data set into distinct groups that are easy to understand.
Let’s look at an example.
A class of students is given a survey about how many siblings they have. Their responses are represented on the following pie chart.
The pie chart shows us that there 21% of the students replied as having no siblings, 28% of the students have one sibling, 24% of the students have two siblings, 18% of the students have three sibling and 9% of the students have four or more siblings. Here are some other things you might be asked about this pie chart:
A scatterplot is a group of points plotted on a cartesian plane representing two different variables. You can think of this as points placed on a plane consisting of an x-axis and y-axis, with each axis labeled as one of the variables. The location of each point is defined by its coordinate pair (x, y). These plots are commonly used to analyze if there is some sort of relationship, or correlation, between the two variables being measured.
Consider the following example.
The heights and weights of 200 adults are recorded. Each person’s height and weight is represented by a point on the scatterplot where the x-coordinate represents the height value and the y-coordinate represents the weight.
What can we learn about this data set from analyzing the scatterplot?
You will often see a squared R value, R², presented with a scatterplot. R² can be thought of as a measurement of the strength of the relationship between the two variables being measured on a scatterplot.
An R² value of 1 defines a perfect relationship, meaning one variable can perfectly predict the value of the other variable. In other words, there is a direct linear relationship between the two variables. The example below shows a trend line included along with the data set, showing us that all the points lie along the line.
You would expect an R² of 1 when relating feet to inches because 1 foot always equals 12 inches. This means if you know the length of something in inches, you also know its length in feet.
An R² value of 0 defines no relationship between the two variables. This means knowing the value of one variable gives no information on the value of the other variable. For example, a scatterplot showing the relationship between the number of letters in someone's name and how high they can jump. This would likely result in an R² value of 0 because you do not expect these to be related to each other. Below is an example of a plot with a R² value near 0.
A line chart, or line graph, consists of a series of data points that are connected by a line. They are similar to a scatterplot, but while a scatterplot maps the relationship between two values and may contain a large number of points, a line chart consists of relatively few ordered measurements that are joined by straight line segments (rather than a line of best fit). Line graphs are frequently used to show trends over time.
Now let’s look at an example.
This line chart represents the number of sunglasses sold (in thousands) in each month of the year.
A box plot, also known as a box-and-whisker plot, is used to show the distribution of data. It is a useful visual tool because it allows us to easily see the median and range of a set of data, as well as its different quartiles. Quartile is a statistical word that means quarter - data can be divided into four quartiles, where the first quartile represents the lowest 25% of data points and the fourth quartile represents the highest 25% of data points. The median is in between the second and third quartile since 50% of the data will fall above and below this point.
Let’s look at a simple example.
Above is a simple box and whisker plot with a normal distribution. This means that the data are symmetrical about the median. We can see that the line in the middle represents the median - in this box-and-whisker plot, the median is 15. The bottom and top “whiskers” represent the minimum and maximum data points, respectively.
The box and whisker plot can be divided into four quartiles, as shown above. The first quartile (Q1), is represented by the area between the bottom whisker and the bottom of the box. The second quartile (Q2) contains the data points that fall in the area between the bottom of the box and the median. The third quartile (Q3) contains points between the median and the top of the box. The fourth quartile (Q4) contains points that fall between the top of the box and the top whisker. For example, the data point 16 would fall within the third quartile.
We can also easily determine the range of a box-and whisker-plot by subtracting the maximum value from the minimum value. The range of this set of data would be 18 − 12 = 6.
Now let’s look at a more complicated example.
The box plot below shows a class’ scores on 3 different tests.
You may notice that these boxes aren’t as symmetrical as in our first example. The shape or symmetry of the box can tell us about the distribution and variance of the data.
Let's think about the kinds of questions we can answer using a box-and-whisker plot: