Lift

Lift(B, C) = c(B->C)/s(C) = s(B u C)/(s(B) x s(C))

Q1: Lift Analysis

Please calculate the following lift values for the table correlating burger and chips below:

  • Lift(Burger, Chips) = s(BuC) = 600/1000 / s(B)= 800/1000 * 1000/1000 =  .6/.8*1 = .75 Negative
  • Lift(Burgers, ^Chips) = 400/1000/ 1000/1000 * 600/1000 = .66 Negative
  • Lift(^Burgers, Chips) = 200/1000/ 400/1000*800/1000 = .625 Negative
  • Lift(^Burgers, ^Chips) = 200/1000 / 400/1000* 600/1000 = .769 Negative

Please also indicate if each of your answers would suggest independent, positive correlation, or negative correlation?

Chips ^Chips Total Row
Burgers 600 400 1000
^Burgers 200 200  400
Total Column 800 600 1400

 

Q2:

Please calculate the following lift values for the table correlating shampoo and ketchup below:

  • Lift(Ketchup, Shampoo) 100/1000 / 300/1000* 300/1000 = 111 Positive
  • Lift(Ketchup, ^Shampoo) 200/1000 / 300/1000*600/1000 = 111 Positive
  • Lift(^Ketchup, Shampoo)= 200/1000 / 600/1000*600/1000 = 555 Negative
  • Lift(^Ketchup, ^Shampoo) = 400/1000 / 600/1000*600/1000 = 111 Positive

Please also indicate if each of your answers would suggest independent, positive correlation, or negative correlation?

 

Shampoo ^Shampoo Total Row
Ketchup 100 200 300
^Ketchup 200 400 600
Total Column 300 600 900

 

Q3: Chi Squared Analysis

Please calculate the following chi squared values for the table correlating burger and chips below (Expected values in brackets). =  16.25

  • Burgers & Chips Negatively
  • Burgers & Not Chips Positively
  • Chips & Not Burgers Positively
  • Not Burgers and Not Chips Negatively

For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?

Chips ^Chips Total Row
Burgers 900 (800) 100 (200) 1000
^Burgers 300 (400) 200 (100)  500
Total Column 1200 300 1500

Q4: Chi Squared Analysis

Please calculate the following chi squared values for the table correlating burger and sausages below (Expected values in brackets).=  0

  • Burgers & Sausages (independent)
  • Burgers & Not Sausages (independent)
  • Sausages & Not Burgers (independent)
  • Not Burgers and Not Sausages (independent)

For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?

Sausages ^Sausages Total Row
Burgers 800 (800) 200 (200) 1000
^Burgers 400 (400) 100 (100)  500
Total Column 1200 300 1500

 

Q5:

Under what conditions would Lift and Chi Squared analysis prove to be a poor algorithm to evaluate correlation/dependency between two events? Where there is a high amount of Null Transactions

Please suggest another algorithm that could be used to rectify the flaw in Lift and Chi Squared? Kulczynski

 

History of Data Visualisation

Figure 1: Minards Map
Figure 1: Minard’s Map

Although traces of data visualisation techniques can be observed as far back as the 2nd Century, Renes Descartes is credited with forming the foundation of modern data data visualisation by developing a two dimensional coordinate system  for displaying values. By the end of the 17th century this had developed into representing quantitative data.

Figure 1 shows Charles Joseph Minard’s chart of Napoleon’s retreat from Russia in 1812. This is an early example of how data visualisation can bring dry statistical information to life. (An English translation of the original graphic is shown at the end of the post.)

“The chart tells the dreadful story with painful clarity: in 1812, the Grand Army set out from Poland with a force of 422,000; only 100,000 reached Moscow; and only 10,000 returned. The detail and understatement with which such horrifying loss is represented combine to bring a lump to the throat. As men tried, and mostly failed, to cross the Bérézina river under heavy attack, the width of the black line halves: another 20,000 or so gone. The French now use the expression “C’est la Bérézina” to describe a total disaster.”

This extract is taken from the book “The Visual Display of Quantitative Information (1983)”, also in this book Edward Tufte states:

“Excellence in statistical graphics consists of complex ideas communicated with clarity, precision and efficiency. Graphical displays should:

  • show the data
  • induce the viewer to think about the substance rather than about methodology, graphic design, the technology of graphic production or something else
  • avoid distorting what the data has to say
  • present many numbers in a small space
  • make large data sets coherent
  • encourage the eye to compare different pieces of data
  • reveal the data at several levels of detail, from a broad overview to the fine structure
  • serve a reasonably clear purpose: description, exploration, tabulation or decoration
  • be closely integrated with the statistical and verbal descriptions of a data set

Visualisations should not intend to analysis the data itself but to present the data in ways that allow for concise analysis.

Today in the age of big data the possibilities of data visualisation are at a whole new level. Open source visualisation tools are available to everyone and the capabilities are rapidly advancing.

Minards Map in English
Minards Map in English

Try R & Titanic Database

Capture

Using R, I created some graphs using data based on the chances of survival for passengers on the Titanic. This was done by following the steps put forward by a user on the website Kaggle who had attempted to use the data to create a machine learning model that would predict likelihood of survival based on 11 different inputs.

The various library packages needed to process these data visualisations were installed using the “instal.package” function. 3 of these, ggplot2, ggthemes, and scales were for the purpose of data visualisation.

Comma Separated Value files, CSV, were assigned to variables named “train” and “test”, these were then binded into another variable named “full”.

The first chart we created using this data in R was with regard to whether or not “Family Size” had an impact on chances of survival. This was visualised using ggplot2 and achieved by creating a family size variable and a family variable. Family size was then assigned to the X axis and a survival count was added to the Y axis.

Rplot02

Once we had established that there was a possible correlation between family size and survival rate, we then reconfigured the data into a mosaic plot.

Rplot01

Once we had completed this we then went about estimating the values of some missing data. In this case we wished to discover where two of the passengers had embarked from, and the best way to discern this was to check how much the passangers had paid for their ticket, cross referenced which their class, and compare it to the mean price for equivilant tickets in each embarkment area. This ggplot graph included boxplots and a mean line and showed us there was a good chance they had left from Cherbourg. (I forgot to export this graph but the equivalent is shown below.)

Capture22

The next graph helps us find the median value for a passenger of a certain class who left Southampton. We then used a function to replace the missing value in this instance with the median values for similar cases.

Rplot03

We then created a predicting model to fill in the relatively large number of missing “Age” values. The model used was the multiple imputation using chained equations (MICE) model. This model predicted peoples ages and its validity was checked by plotting original and MICE outputs and checking if they were relatively similar.

Rplot04

The final graph showed us the disparity between men and women in the survival rates, as well as the importance of age as a variable. A “geom_histogram"  was created to illustrate this using ggplot.

Rplot05

The chart below is one I made to show the survival numbers of each port. The unmarked variable is where data was missing and not filled in yet as it was made before the data cleaning described previously had been taken place.

Rplot

The data set lets us predict our likelihood of survival on the titanic based on 11 criteria. The information shows us that women, children, and higher social classes were the least likely to die, where as single men were the most likely.

If I had more time it would be interesting to look at some demographic information that may not be directly related to survival rates, such as the relationship between family size and social class, or to delve further into the socio demogrpahic differences between points of embarkment.

(Fact: It cost more money to make the movie Titanic than the actual ship.Movie’s cost: $200 million.
Ship’s cost: $7.5 million ($150 million when adjusted for inflation).)

Heat Map of Irish Population

Heatmap
Irish Population

In order to create this map I  used Google Fusion tables, KML files, and the CSO website. I first opened the google fusion table which prompted to add either a file or a URl, I chose to paste in the URL from the CSO web page that held the population data for each county, as well as some additional constituencies within cities and certain counties. I was then prompted to chose between a couple options regarding what to do with the URL, of which I chose to open in it fusion tables.

Once these population figures and county names were successfully opened in a fusion table, I then set about importing the file necessary for mapping out the country boundaries. This KML/KMZ file was downloaded and subsequently uploaded into google fusion tables. Upon uploading, I then chose to merge this file with the county population table using the merge option in the top left of fusion’s tool bar. I then selected to merge along the “county names” columns.

On first inspection the map is not 100% complete which suggested that some data cleaning was in order. When I checked the county name section I noticed some of the names were either given in Irish spelling, or divided up into north and south of the county, as was the case with Laois and Tipperary respectively.

Now that these tables were successfully merged it was time to create the heat map. This was achieved by selecting the map features option bar which allowed me to set each county as a bucket and fill it with gradients of colour which would represent population density.

The map was now complete, and after adding the key to the side of the map in order to display the value of the various colours, I now had a visual tool which could be used to discern population based on county.

This kind of map can be very useful for planners when deciding which areas need investment in infrastructure, where housing might be needed, and for locating the main urban centres.

The heat map utilises the data which has been input in order to present the end user with information regarding population density. This can be easily adapted to present different kinds of related information such as intensity of demographic make up.

The map could be also used to show the change in population of counties, religious affiliation, languages spoken, voting intentions etc. Much of any data which can be categorised by county can present information by means of heat map. Particularly in relation to demographics, population, and even elections.

I also attempted to use google fusion maps to try illustrate the 4 colour theorem.  This shows the “given any separation of a plane into contiguous regions, producing a figure called a map, no more than four colors are required to color the regions of the map so that no two adjacent regions have the same color.” (wiki) although I underestimated the time it would take to create and did not have a chance to complete it.

It is possible however and requires the creator to create an excel table of all the counties and assign them a number of 1-4 based on their position on the map, then selecting 1-4 as the gradients in the bucket selection of the formatting option. An example is shown below: