The List of Data Science Blogs: ‘When Local Host isn’t enough’

I will introduce another interesting Blog to a list that I hope will only increase here with The Information Age: the list of interesting and relevant Data Science Blogs. This will be one more opportunity to learn and divulge all about Data Science and data engineering to an audience just about hungry for more. This post I am re-sharing today is particularly interesting one from a blog called When Localhost Isn’t Enough written and hosted by data scientist Alex Ionnides; full of code snippets from R language – with Python forming the two  lingua franca of Data Science these days -, and about a topic that is also currently of increasing interest which is Asynchronous Distributed Programming (Computing):

Asynchronous and Distributed Programming in R with the Future Package

Source: Asynchronous and Distributed Programming in R with the Future Package

 

Highlights:

 

Every now and again someone comes along and writes an R package that I consider to be a ‘game changer’ for the language and it’s application to Data Science. For example, I consider dplyr one such package as it has made data munging/manipulation that more intuitive and more productive than it had been before. Although I only first read about it at the beginning of this week, my instinct tells me that in Henrik Bengtsson’s futurepackage we might have another such game-changing R package.

The future package provides an API for futures (or promises) in R. To quote Wikipedia, a future or promise is,

… a proxy for a result that is initially unknown, usually because the computation of its value is yet incomplete.

A classic example would be a request made to a web server via HTTP, that has yet to return and whose value remains unknown until it does (and which has promised to return at some point in the future). This ‘promise’ is an object assigned to a variable in R like any other, and allows code execution to progress until the moment the code explicitly requires the future to be resolved (i.e. to ‘make good’ on it’s promise). So the code does not need to wait for the web server until the very moment that the information anticipated in its response it actually needed. In the intervening execution time we can send requests to other web servers, run some other computations, etc. Ultimately, this leads to faster and more efficient code. This way of working also opens the door to distributed (i.e. parallel) computation, as the computation assigned to each new future can be executed on a new thread (and executed on a different  core on the same machine, or on another machine/node).

 

(…)

 

Non-Blocking Asynchronous Input/Output

I have often found myself in the situation where I need to read several large CSV files, each of which can take a long time to load. Because the files can only be loaded sequentially, I have had to wait for one file to be read before the next one can start loading, which compounds the time devoted to input. Thanks to futures, we can can now achieve asynchronous input and output as follows,

 

(…)

 

Well worth to click and read all through with the R scripts with horizontal scrolls and a nice Blog design layout overall.

Advertisements

One thought on “The List of Data Science Blogs: ‘When Local Host isn’t enough’

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s