As I work to enter back into a doctoral program with UNISA, I have realized that I haven’t yet had a quick cohesive explanation of why it is so important to me to do the research I am doing. So here is that attempt to explain why I believe what I’m working on will make a significant contribution to the field of education, and beyond that, why it could be something that truly changes the world, and also how others can get involved. Who knows, maybe some day this will become a real TED Talk 🙂

What if hidden in plain sight there is an answer to a world problem, like how to reduce world hunger, increase the economy of developing nations, or lower disease and war? Wouldn’t this be worthwhile searching for? That is what I’m working to do, through the use of data science.

As a humanity we have collected tons of information about each country in this world, from how much a Big Mac costs in each country (with the Economist’s Big Mac Index) to how many people die of heart disease (collected by the World Health Organization), we know a lot. And most of this information is available on the Internet. But do we know if these things affect each other, or cause each other?

Social science research, such as comparative education (which is looking at educational differences between countries), have often looked at individual aspects between countries, and tested to see if there is a correlation between variables. But this has almost always been done based upon human hunches, which we like to call hypotheses. Yet only using hunches to try and find knowledge that is significant is limiting, because first humans can only process a certain amount of knowledge in their brain, and second there could be variables that correlate that a person might never guess that this would be the case.

But what if we could trawl through nearly all the world’s knowledge to find hidden correlations? Well, I believe we can, and in fact, that is what my doctoral research is all about: which is to build a huge database that is a compendium of countries, and then to have software do data mining, in which it looks at all the information collected, and spits out which one’s have a strong correlation.

Although, it should be clear that this process would only be the first step in finding answers to the world’s problems. Many correlations do not mean there is a direct causation between one and the other. For instance, my early research of mining data from the CIA World Factbook (yes, the CIA does some good things), showed a strong connection between the expenditure on education per capita and the expenditure on healthcare per capita, which had a correlation coefficient of nearly 80%. Or in other words, in nearly every country of the world, as spending on health care goes up for each person, spending on education also rises in a ratio that is very consistent. But does this mean one causes the other? Probably not. But maybe it means that humans value both of these in a consistent manner. And isn’t that alone, knowledge that is worthwhile knowing and discovering? We will not know the full extent of what might be valuable until we do the searching, and when the correlations are found, further research can start to ask why the correlations exist, and whether they can be used to help the world.

But, I cannot do this alone. To try and gather all of this data on my own would limit the results of the work, because the more types of data that can be mined, the better chance we have of finding something of value. So I’m crowd sourcing my research, in which I’m developing curriculum to help high school and college students from around the world to become citizen scientists, and be able to contribute to this discovery process. And this will give more effect for the effort, because not only will students be able to contribute to something worthwhile, they will also be learning the fundamentals of data science, which is becoming of growing importance in all fields of science.

So I invite you to join me in searching for the solution to the world problems, by collecting what the world already knows, and then seeing what can be found in this knowledge. You can go to www.CompendiumOfCountries.org to see the current state of the project.

What you described is a nebulous undertaking. My concern is that your project is so open ended, without boundaries, that you will quickly get lost without any direction to bring you back home. At some point you will need to zero in closer to specifics so that you can actually conclude something important. Your reasoning leaves the impression that your conclusions will come from wherever your analysis leads you, which is usually not specific enough for doctoral work.

On the other hand, if your thesis is in Computer Science, then your approach would be relevant if your end goal is to discover new algorithms to be used to discover patterns among data. As such, social and economic conclusions would take deference to the process, meaning that analytical process innovation would be your prime deliverable. That’s tough because whatever you come up with will be compared to all other statistical methods already available, and the question constantly coming up will be, why is your new method any better?

Correlation analysis is one method to use, but a better method that has wider acceptance in academic circles is multivariate linear regression analysis, where independent variables are not just examined to see how well they correlate with the dependent variable, but additionally are analyzed to see exactly what the linear relationship is, which is expressed as a regression coefficient for each independent variable. You will then need to compute the coefficient of determination (aka R-squared value) for each independent variable to determine whether it is relevant or not. An R-squared value is between 0 and 1, with 0 meaning the independent variable’s regression coefficient completely fails to explain the dependent variable, and 1 which means it explains it 100%. Quite powerful stuff.

You raise the issue of causation vs correlation. Well, everyone already knows correlation does not absolutely imply causation, but what you don’t seem to understand is that correlation can imply strong causation if it is first and foremost backed up theoretically. That is why empirical methods are almost always only applied after a strong theory has been established to explain a behavior or effect. First come up with a theory. Then prove it as being highly probable in explaining observed behavior by analyzing the empirical evidence, in a way that can be replicated. That’s the way it’s done professionally and if your doctorate is to be in the social sciences, that is what they most likely will want to see.

Social and economic and financial forecasting models are stochastic rather than deterministic. That means that causation can be found to a highly certain degree in many cases, but is subject to a measured probability of occurring, within a specified confidence interval. That’s why I am dissuading you from using correlation analysis and instead am steering you to something better, being linear regression.

In case you are wondering, linear regression can be applied to many nonlinear situations, such as those involving compound exponential growth, by simply regressing the natural logarithm of each data point instead of it’s primitive observed value.

I looked at the links you provided for data mining and other related activities, and was not impressed. Not much substance there for work at a doctoral thesis 1evel. I’d up my game.

Data mining in the real world revolves around well established technologies including software from Micro-Strategies, IBM’s Cognos, and to a limited extent Business Objects, among many others. The trend has been for large DBMS vendors such as Oracle and Microsoft to supply their own data mining tools which these days often go under the Business Intelligence (BI) nomenclature. Oracle has Oracle BI and MS has MS SASS (SQL Server Analytical Services). Rather than inventing the wheel, you could try one of these products, some of which are available for free for educational purposes. Of course, once you go commercial, you need to pay.

One statistical package standard that still is highly regarded is SAS (Statistical Analysis Software) which can do ANY stat processing you will ever dream of and more. Not sure if they have a free version for educational purposes, but you can check.

Finally, the site you referenced described Data Artists and other nonsense. Those people don’t exist in the business world. Real world data artists are people who know either a BI product inside out including the graphical reporting functions built inside, or know a powerful modern GUI based reporting language like MS SSRS (SQL Server Reporting Services) or Crystal Reports. My preference is for SSRS (I’m an expert at it but that has little to do with my recommendation) since it has better database features, and these days everything is database driven.

Good luck. Don’t try to conquer the world. Rather, come up with a thesis that address an important issue, that shows you are completely capable of applying professionally accepted tools and techniques to get the job done.

I appreciate your comments. While your comments were made with a correct understanding of the broader business and data science context, I don’t believe they are accurate within my context. First, the “TED Talk” outline of this post is meant for a general audience. If you read my initial research proposal, you will see the things that address the real nuts and bolts of the statistical analysis that I plan to use. (And that proposal is changing because my initial thought about using Excel would not work for the scale that I want to do.)

Regarding your arguments about regression, as can be seen in my full research proposal, that is what I’m going to do, although I’m not planning to do multivariate regression up front (but want to have the data setup to be able to do this in the future). Regression is a way of finding correlation if the data is the right type. But given that some data may only be rankings, etc, then the proper form of statistical analysis must be done, which I plan to do. (Again this is in my full initial proposal) Also from my experience, there are often correlations that are not linear, and so my research will also check for many forms of non-linear correlations.

Regarding the site you reference about the stages of data science, I hesitated posting about it, because it really is VERY rough at the moment. But I think you might misunderstand its purpose. It is a site I’m creating, to try to teach kids (middle school and high school) the basics of data science, and to help them contribute data to my research. So yes, I know that there is not a “data artist”, but there are those who program data visualizations, and many of these are becoming more artistic, as the end-user often appreciates aesthetically pleasing visualizations. (And, yes, I am familiar with Tufte’s arguments about functionality, but if one can have both function and beautiful form, then that is even better, in my professional opinion)

The only argument that I think you make that is relevant (not that most of your arguments are wrong, but again I don’t think you really understood what I’m really doing), is the one about that I might be biting off more than I can chew. But this is the whole reason I’m working to crowdsource the project. And while I would love to get 10,000 variables in the analysis, and that is my “moon shot”, I realistically expect to have far fewer than that… But, the research project will be setup in a way that can have continual contribution, so my initial thesis may be on a smaller subset of variables that have been searched for correlations, than what will ultimately be analyzed.

So, while your response frustrates me, because I don’t think you “get” what I’m really doing, I truly appreciate you reading my stuff at a more rigorous level. Most of my friends who read this blog are not data science nerds 🙂

Ok, I read your proposal, and here’s some feedback:

You are right. You need to use simple regression rather than multivariate regression since randomly associating more than one independent variable with a dependent variable would introduce the problem of direct dependencies and co-dependencies between independent variables. For example, a normal multiple regression model has to be constructed very carefully using rational judgement to ensure that the independent variables are not influencing each other. In your methodology, you don’t have the luxury to rationally determine the potential for dependency since you basically are going to generate an analysis of every possible relationship, some with heavy dependencies and other without. OK, I get it.

Co-dependency is still a problem with simple regression. For example, if A causes a reaction in B, and A also causes a reaction in C, then regressing B against C may indicate that B causes a reaction in C, which in reality is erroneous since the observed B-C reaction is really attributed to A influencing both B and C simultaneously. You might want to build in automated means to flag these circumstances, which may be difficult to do.

Your data point database will actually be rather small. There are approximately 200 countries in the world. If you gather 100 facts for each, then you will have only 20,000 records. If you up that to 1,000 facts for each country, you still only have 200,000 records. If you maintain five years of monthly data to feed your regressions, then multiply each of these scenarios by 60, giving you 120,000 and 12,000,000 records respectively, easily capable of being handled by a modern efficient DBMS.

The real problem is the number of possible relationships between variables, which could be enormous. I suspect that for the case of 100 facts for each country, you will have (100*100)/2 relationships * 200 countries, meaning 1,000,000 relationships to analyze. And for 1000 facts, you will have (1000*1000)/2 * 200 countries, meaning 100,000,000 relationships to analyze. My math might be a bit off, but you get the idea. The issue is that you will need a large repository to store the results, which will take up substantially more space that the data points themselves.

Because of this enormous storage of results issue, Excel is clearly not the way to go. You need a real DBMS. With a large number of records, you need indexes for efficient processing for both batch analysis and retrieval of individual results. Only a DBMS will allow you to to do this.

Also, you need to use a fully normalized database since denormalizing will exacerbate the storage problem even further. There is no advantage to denormalized databases other than speed for sequential retrieval of large numbers of records.

The problem with denormalization is that you ALWAYS loose flexibility in using data. I have an extensive background in Data Warehousing involving both fully normalized data warehouses and denormalized datamarts, and can tell you from experience that under conditions of uncertainty, where you don’t initially know how the data will be used, the best choice is always normalization. Denormalization in the real world is usually implemented when you have a specific application that accesses data (in a datamart for example) in a more-or-less done fixed manner, which in turn can be optimized for performance.

Also, you will need TWO database, one to store the observed data points, and a second to store the results of each individual analysis, including at minimum the regression coefficient and correlation coefficient for each relationship. Since you will have millions of results any way you cut it, you will need a database to contain the results so that they can be efficiently queried.

I’m really talking too much here but I’d wind down by mentioning the need for computing and storing R2 values. I think you know their importance, but I’ll summarize anyways. R2 can be between 0 and 1 inclusive. An R2 of 1 means that 100% of all data points used in the regression fall exactly on the line calculated by the regression. In this case, the data points themselves form a perfect line that could be observed even without doing a regression.

An R2 of exactly 0 means that no specific line can be determined from the regression since the plot of data points are randomly and evenly distributed in an evenly dense circular pattern, and therefore any line draw through the the data points would be just as accurate as another. For example, a line drawn vertically through the points would fit the pattern just as well as a line drawn horizontally, or just as accurate as a line drawn with any slope randomly chosen for that matter.

Rarely do we have an R2 of exactly zero which introduces an interesting issue. If R2 is very close to zero, such as having a value of .01 meaning 1%, the regression will deliver to us a slope in the form of a regression coefficient that appears to explain a relationship, when all it really does in this case is explain that the regression is 99% erroneous.

A common rule of thumb is to consider regression results useful only if R2 is greater than 50%. An R2 of less than 50% means that the regression explains more in terms of error than in does in terms of the relationship between the independent and dependent variables. Or to put it differently, an R2 of exactly 50%, means you have the same chance of a coin toss coming up heads, as you have being right with your prediction based on the regression. Since I don’t like 50/50 propositions, I prefer R2 values of at least 70%, 80%, or in some cases even higher.

That’s it. Have fun with your project.

And…don’t make it hard on yourself. Your doing simple regression, right? Well, many modern DBMSs have regression built in as SQL functions so you won’t need any expensive stat package like SAS.

For example, here’s the regression SQL functions available within Oracle:

http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions132.htm

What’s cool with this method is that you can write SQL to regress and to then automatically enter the results into your results table using an INSERT/SELECT statement.

Basically the coding process is as follows:

1) Write a SQL statement that creates in mass a set of SQL INSERT/SELECT statements for every possible relationship to be examined. Add some WHERE selection criteria to so that one country can be processed at a time.

2) Run in batch the INSERT/SELECT statements created in step #1. Just submit the statement file using batch SQLplus and let it chug away for a few hours.

3) Come back later and examine your results after step #2 has completed. Write a query against the results table to sort the relationships in descending order by R-Squared, meaning the most relevant relationships will appear on top.

Setting up the database and tables for this, as well as setting up the processes outlined above would take about a half day.

Given that the algorithm is going to change which method of regression / finding the coefficient of correlation dependent upon the type of data it is trying to fit, I don’t think using built-in database functions is going to do what I need. In fact, my recent look at Python is convincing me that even SciPy doesn’t have all that is necessary, and that I will need to use some R, which I can bridge back to Python (so I can use Django for the website). But, if I wasn’t doing more complex stuff, then I would take up your suggestion as clearly using the database’s functionality natively is usually much quicker than writing external code.