#000: In which our heroine does predictive modeling, forever.

I recently ran a survey on SurveyMonkey to determine content for this blog, and to gauge interest in the five branches of analytics (though, in its defense, cognitive computing can be applied across all the other four branches). 194 responses and a promise that I would create one [ video, blog, live programming broadcast, interpretive dance, etc. ] per unique IP address later, the internet has spoken!  Looks like we’ll be seeing a lot of each other for the next couple of years, huh?

(It’s okay. I love y’all, too.)


Question 1:
What is your preferred medium for learning materials?


Blog posts took the lead with an average score of 2.36.  Jupyter notebooks, short videos with text, and YouTube videos had fairly similar averages (with scores of 2.96, 2.93, and 2.91, respectively), but extremely varied distributions. As such, I’ll be focusing more on Jupyter notebooks and less on videos. Live programming broadcasts brought up the rear with an average score of 3.85, and an extreme left skew — which is fantastic!  I mean, you wouldn’t want me and my career to go all Truman Show, right?

Regardless, I’ll be incorporating several live-stream events in the hope that it will help you troubleshoot your own code, or resolve issues when incorporating your training / test data into a machine learning model. As Greg Wilson reinforces: real programmers always make mistakes; real programmers use documentation religiously; and real programmers would go Mad Max if they didn’t have access to tools like StackOverflow.


Question 2:
Which branch of analytics is most interesting to you?


Predictive analytics (including machine learning!) was the winner this time, with an average score of 2.73.  Diagnostic followed with 2.90;  then Prescriptive (3.05), Cognitive (3.14), and Descriptive (3.21).  The definitions given for the five branches were as follows:

  • Descriptive:  What happened?
  • Diagnostic:  Why did that happen?
  • Prescriptive:  What is the best course of action?
  • Predictive:  What will happen?
  • Cognitive:  What questions should I be asking?

…Ironically, for the first 6 hours of the survey, blog posts and predictive modeling were ranked dead last in positive (rank 1) end user responses — which would have defeated the entire purpose of this website.  Thank heavens for people who read Twitter during typical business hours.

Moral: never base your training data on the 10:00pm – 5:00am CST crowd.


Question 3
Please rank the topics below, according to interest:

  • Data Cleaning / Data Engineering
  • Machine Learning
  • Data Mining
  • Data Visualization
  • Productionalizing Algorithms
  • Machine Learning tools (e.g., Azure ML Studio, RapidMiner, etc.)
  • Data Science IDE SWOT analysis (e.g., RStudio, Spyder, etc.)
  • Analytics in cloud environments
  • Using Hadoop (and its associated dongles) for data science

Machine Learning came in first (whew!) with an average of 2.71; strangely, though, it and data cleaning / data engineering have the largest variance (with standard deviations of 2.43 and 2.41, respectively). Data Visualization’s the second most popular, with a score of 3.63.

Hadoop analytical tools, strengths and weaknesses of data science development environments, and techniques for practicing analytics in the cloud (be it on AWS, Azure, Google Cloud, or your favorite flavor of server service provider) are resoundingly unpopular topics – which is a shame!  I’ve found those solutions to be the most difficult to implement. Will still try to weasel in a couple of posts about Ibis, Azure ML Studio, and the Cortana Intelligence Suite.

For those who are curious, the following is an extremely unrepresentative sample of preferred cloud providers, taken from a recent Twitter poll:

Digital Ocean was also a strong contender.


Question 4
What is your preferred programming language?


…oh, screw it. Let’s do both.


Question 5
On which continent do you live?


This distribution is troubling.  There is decidedly poor representation in Africa, in Asia, in South America, and in Australia;  Europe and North America constitute almost 90% of responses.  In an attempt to raise interest in those geographic areas, at least 20% of the data used for analysis in this blog will be applicable to those three under-represented continents.

(You, too, Antarctica. I promise to talk about you very, very often.)


At any rate:  welcome aboard!  Hopefully it’ll be a fun ride.  I can’t promise enlightenment, but will definitely be able to be able to surface a few Marvel universe .gif’s. (And no, even though there are weak descriptive statistics, this doesn’t count as one of the 197 entries.  Where would be the fun in that?)