Data Science Blog at Tumblr: The Exactness of Mind

The Tumblr social network platform is a nice one in many ways. Of course I have a profile there, and also, as in Facebook, I  host some curation pages there. The functionality and versatility of Facebook continues to be superior, but the Blogging community might prefer Tumblr’s ease of use and compelling writer interface that it offers.

I do not regularly post or submit much in Tumblr. But from now and then I discover some nice Blogs hosted by someone creative enough to rouse my attention, specially on the topics that I care the most, say Science & Technology, Business and Economics and significant mind changing Cultural issues (the ordering of these topics matters here, as it is precisely my order of priority of interests).

While doing my usual research and monitoring of interesting topics to post about on those topics, I found one Blog hosted in Tumblr that I thought was right and proper to share, comment and post about here in The Information Age. It will feature here once again in the future if the content is relevant enough, and I certainly bet it will have that content. The Blog name is The Exactness of Mind and it is by a Serbian Data Scientist.

I am from a generation of Portuguese graduates that always heard about the quality of the education systems of the former Eastern European countries. This kind of Blogs offer another confirmation about this. It also deepens my negative surprise as to why those Countries never managed to have robust Social and Economic institutions, but it is just another point to make about the non-linear correspondence between the quality of Educational outputs and the quality of  the economics apparatus and institutions. There is more to this story, like for instance cultural (behavioral) issues and the right mix of institutional quality with an openness to innovation and entrepreneurial free will (spirit), that might explain the non-linearity. My own Country is not a fortunate one also on this mix, but it is a country with undoubtedly more openness to business entrepreneurial acumen within an acceptable quality institutional and political framework; add to this the natural empathy and peaceful habits of Portuguese people and you just also wonder why it is always in the low ranks of the developed world tables… Education and quality of the Human resource management might go a long way in explaining this…

Anyway after having said all this, I would like to come back to the nice Blog post in The Exactness of Mind by Goran S. Milovanović, PhD , titled Distributional Semantics in R: Part 2 Entity Recognition w. {openNLP}, and share some quite important and nice features of the post. This post is interesting for a number of reasons. First Goran S. Milovanović is a Cognitive Scientist with a Natural Language expertise; second the post is about important topics about advanced Data Science and within it we can link to resources provided by Goran on a course he lectured called Methods of Distributional Semantics in  R; finally the post is written in very good English and with William Shakespeare, none other than the great Renaissance English playwriter master, thrown in this mix. Yes, it is this interesting, you wonder!

Distributional Semantics in R: Part 2 Entity Recognition w. {openNLP}

 

Following my Methods of Distributional Semantics in R BelgradeR Meetup with Data Science Serbia, organized in Startit Center, Belgrade, 11/30/2016, several people asked me for the R code used for the analysis of William Shakespeare’s plays that was presented. I have decided to continue the development of the code that I’ve used during the Meetup in order to advance the examples that I have shown then into a more or less complete and comprehensible text-mining tutorial with {tm}, {openNLP}, and {topicmodels} in R. All files in this GitHub repository are a product of that work. 
Part 2 will introduce named entity recognition with {openNLP}, and Apache project in Java interfaced by this nice R package that, in turn, relies on {NLP} classes. We will try to make machine learning (MaxEnt models offered in {openNLP} figure out the characters from Shakespeare’s plays, a quite difficult task given that the learning algorithms at our disposal were trained on contemporary English corpora.

 

tumblr_inline_oj5h0ldqox1qa0hyw_500Figure 1.The accuracy of character recognition from Shakespeare’s comedies, tragedies, and histories; the black dashed line is the overall density. The results is not realistic (explanation given in the respective .Rmd and .hmtl files).

 

 

What I really want to show you here is how tricky and difficult it can be to do serious text-mining, and help you by exemplifying some steps that are necessary to ensure the consistency of results that you are expecting. The text-mining pipelines being developed here are in no sense perfect or complete; they are meant to demonstrate important problems and propose solutions rather than to provide a copy and paste ready chunks for future re-use. In essence, except in those cases where a standardized information extraction + text-mining pipeline is being developed (a situation where, by assumption, one periodically processes large text corpora, e.g. web-scraped news and other media reports, from various sources, in various formats, and where one simply needs to learn to live with approximations) every text-mining study will need a specific pipeline on its own. Chaining those tm_map() calls to various content_transformers from {tm} restlessly, while being ignorant of the necessary changes in parameters and different content-specific transformations – of which {tm} supports only a few – will simply not do.

 

 

tumblr_inline_oj5h5yb5yz1qa0hyw_5002

Figure 2.Don’t get hooked on the results presented in the {ggplot2} figure above; {openNLP} is not that successful in recognizing personal names from Shakespeare’s plays (in spite of the fact that it works great for contemporary English documents). I have helped it a bit, by doing something that is not applicable to real-world situations; go take a look at the code from this GitHub repository.

 

After not being shy and share all of the content in the post (it wasn’t really possible though…) I just feel that English speakers and writers will be happier to know about Dr. Milovanović work. And we can just wonder what Mr. Shakespeare would have thought about this, ah… Completely confused, to say the very least. See you soon…

 

featured image: The Exacteness of Mind

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s