Homomorphic encryption, statistical machine learning and R software package

One of the papers that I found quite important of the last couple of years is this A review of homomorphic encryption and software tools for encrypted statistical machine learning, from the Department of Statistics of the University of Oxford, England. I discovered it recently but haven’t found time to do it justice and post a comment/review for the readers of this Blog. It also has the additional incentive of featuring a Portuguese fellow researcher in Statistics, Pedro M. Esperança.

The paper is of particular importance and significance because it is about two topics in modern Computer Science that are making headlines on a daily basis, without signs of this trend to fade anytime soon: Machine Learning and Cryptography. These are topics that have come of age in terms of the quality of the research output, but also from an innovative application point of view. Proofs-of-concept in both fields are plenty and, with more or less innovative quality or significance, the enthusiasm and excitement keep growing. Unsurprisingly the scientific discipline of Statistics is happy with this fact. For it plays a crucial role both for Machine Learning and Cryptography. In the case of Cryptography it isn’t much statistics that is structurally important, but Probability. We, who have studied Engineering at undergraduate and graduate levels, might as well remember with positive nostalgia the Probability&Statistics classes. In my time when I disdstudy it they were taught as a unified subject. I don’t know if that is still the case, but I do understand that may not be.

A review of homomorphic encryption and software tools for encrypted statistical machine learning 

Abstract

Recent advances in cryptography promise to enable secure statistical computation on encrypted data, whereby a limited set of operations can be carried out without the need to first decrypt. We review these homomorphic encryption schemes in a manner accessible to statisticians and machine learners, focusing on pertinent limitations inherent in the current state of the art. These limitations restrict the kind of statistics and machine learning algorithms which can be implemented and we review those which have been successfully applied in the literature. Finally, we document a high performance R package implementing a recent homomorphic scheme in a general framework.

The motivation behind this research paper is explicitly mentioned in the very first paragraph of its introduction. Indeed the modern digital world is wonderful in many ways, but it is also a place of increasing security concerns with the security and integrity of data or information. In fields such as statistical learning, machine learning and data analytics, sensible information is running through digital pipelines daily, but the full security and integrity of these isn’t always completely guaranteed. The need to come up with solutions therefore had become a pressure zone for everyone involved. For instance take medical records, to just mention one area of obvious concern. The data integrity of medical records is of paramount importance for the physician and the patient, on an almost equal footing. The physician does want to maintain a quality record of his patients, for his reputation dwells in providing the best service and to guarantee it he must have the record straight and precise. The patient obviously is even a more direct interested part, for his quality of life, the privacy of his medical record might depend on the quality and integrity of the medical record and information about him. Add to this the emergence of new devices. which promise to provide benefits and an improved interaction between the physician, the patient and the information about what is happening with the patient’s health, but at the same time is fraught with all sorts of security dangers, some envisioned ans others completely unexpected:

The extensive use of private and personally identifiable information in modern statistical (and machine learning) applications can present an obstacle to individuals contributing their data to research. As just one example, when considering contribution to biobanks Kaufman et al. (2009) reported 90% of respondents had privacy concerns. Addressing these concerns is paramount if the participation rate in biomedical and genetic research is to be increased, especially for government and industry where public trust is lower (Kaufman et al., 2009). Indeed, industry is on the brink on embarking on biomedical applications on a scale never before witnessed via the impending wave of so-called ‘wearable devices’ such as smart watches, which present serious privacy concerns. Companies hope to market the ability to monitor and track vital health signs round the clock, perhaps fitting classification models to alert different health concerns of interest. However, such constrained devices will almost certainly leverage ‘cloud’ services, uploading reams of private health diagnostics to corporate servers. Herein, it is demonstrated how recent advances in cryptography allow individual privacy to be preserved, whilst still enabling researchers and industry to incorporate such data into statistical analyses.

The encryption of all this sensitive information/data is achieved currently with advances within Cryptographic techniques. The question, though is that much of these techniques still require some cumbersome encryption/decryption steps, which in many cases introduces latency in a process deemed urgent. Hence the need to bypass those steps. Recent advances like the homomorphic and functional encryption schemes are showing potential to enhance further encryption schemes, especially ones to be used within a data transmission/sharing context:

 From a data science perspective, the problem with employing cryptographic methods to improve trust is that the data must at some point be decrypted for use in a statistical analysis. However, recent cryptography research in the areas of homomorphic and functional encryption are showing exciting potential to bypass this. An encryption scheme is said to be homomorphic if certain mathematical operations can be applied directly to the cipher text in such a way that decrypting the result renders the same answer as applying the function to the original unencrypted data.

There are some limitations, though:

The remarkable properties of homomorphic encryption schemes are not without limitations, which typically include slow evaluation and the fact that the set of functions which can be computed in cipher text space is very restricted. However, by understanding the constraints and restrictions it is hoped that statistics researchers can assist in the research effort, adapting statistical techniques to be amenable to homomorphic computation by making and quantifying reasonable approximations in those situations where a traditional approach cannot be implemented homomorphically.

The paper by our statisticians aims at developing a methodological framework specifically crafted to homomorphic computation. They accompany this effort with an R programming language package specifically design for this:

The aim of this paper is to provide statisticians and machine learners with sufficient background to become involved in developing methodology specifically crafted to homomorphic computation. As part of this effort we describe an accompanying high performance R package providing an easy to use reference implementation as a core contribution of this work. In a sister publication (Aslett, Esperan¸ca and Holmes, 2015) we present some novel statistical machine learning techniques developed to be amenable to fitting and prediction encrypted.

Homomorphic Encryption

One aspect worth to mention from the outset is the unusual properties of encryption functions. These functions are not to be treated in the usual mathematical sense; for instance in an equation with one such functions, equality is almost always to be treated as an assignment:

Fundamentally encryption can be treated as simply a mapping which takes m and a public key, kp, and produces the cipher text, c ← Enc(kp, m). Notationally, ← is used to signify assignment rather than equality, since encryption is not necessarily a function in the mathematical sense: any fixed inputs kp and m will produce many different cipher texts. Indeed, this is a desirable property for public key encryption schemes, referred to as semantic security: a scheme is semantically secure if knowledge of c for some m has vanishingly small probability of revealing further information about any other encrypted message.

(…)

Semantic security is achieved by introducing randomness into the cipher text which is sufficiently small not to interfere with correct decryption when in possession of ks, but, as will become apparent in the sequel, this essential feature imposes a handicap on all currently known homomorphic schemes.

The homomorphic encryption scheme is one such that the operation of decrypting a message does also the process required to access the original message (it is a bundling of multiple processes in one operation), providing the advantage of skipping the need to do the actual decryption process:

The term homomorphic encryption describes a class of encryption algorithms which satisfy the homomorphic property: that is certain operations, such as addition, can be carried out on cipher texts directly so that upon decryption the same answer is obtained as operating on the original messages. In simple terms, were one to encrypt the numbers 2 and 3 separately and ‘add’ the cipher texts, then decryption of the result would yield 5. This is a special property not enjoyed by standard encryption schemes where decrypting the sum of two cipher texts would generally render nonsense.

homoencryptform1

(…)

Note this is not a group homomorphism in the mathematical sense, since the property does not commute when starting instead from cipher texts, due to semantic security. That is, because the same message encrypts to different cipher texts with high probability, in general:

homoencryptform2

(…)

Another consequence of semantic security is that operations performed on the cipher text may increase the noise level, so that only a limited number of operations can be consecutively performed before the noise must be reduced.

The limitation of former encryption schemes sparked the interest in achieving new schemes capable of performing both addition and multiplication any number of times:

The advent of a scheme capable of evaluating both addition and multiplication a (theoretically) arbitrary number of times led to a surge of optimism, since then any polynomial can be computed and so the output of any suitably smooth function could in principal be arbitrarily closely approximated. Moreover, if M = {0, 1} then addition corresponds to logical XOR, and multiplication corresponds to logical AND, which is sufficient to construct arbitrary binary circuits so that, in principle, anything which can be evaluated by a computer can be represented by an algorithm which will run on homomorphically encrypted data.

In the following part of the paper the authors go on to describe the major former homomorphic encryption schemes that were the product of a flurry of optimistic activity in this area in recent years. I will skip ahead for the obvious blog post limitations of space and scope. But advise the interested to read the details in the paper and its Appendix devoted to the full details. Below are a few more paragraphs from tis section of the paper that I thought relevant pertaining to the usage case scenarios that might occur  with the users and engineers working with this material:

The most obvious usage scenario is to outsource long-term storage and computation of sensitive data to a third-party cloud provider. Here the ‘client’ (the owner of the data) encrypts everything prior to uploading to the ‘server’ (at the cloud provider’s data centre). Due to some of the limitations discussed above, this scenario is perhaps currently only suitable in a restricted set of situations where the added computational costs and inflated data size are not prohibitive. With homomorphic schemes improving all the time the boundary where this is a practical usage scenario will shift over time.

The next paragraph might of relevance within the healthcare sector. Here it might seem that the accuracy and quality of what is proposed  are still something of a wishful thinking scenario… And the increased statistical computation from an increased sample size only achievable with a high degree of coordination and consensus with the institutions, physicians and patients involved. Theoretical only, or something that will see the light of day sooner or later?

An additional scenario is one in which it is desirable to be able to perform statistical analyses without the data being visible to anyone at all. To be concrete, consider a research institute requiring patient data for analysis: the research institute could widely distribute their public key to enable patients to securely donate their sensitive personal data. This data would be encrypted and sent directly to the cloud provider who would have a contractual obligation to only allow the research institute access to the results of pre-approved functions run on that data, not to the raw encrypted data itself. Peer review would be important for pre-approving certain functions to be homomorphically executed to ensure that the original data is not indirectly leaked. An interesting effect here may be increased statistical power (despite homomorphic approximations) due to the greater sample sizes which could result from increased participation because of the privacy guarantees.

As to the coordination between a client (healthcare provider) and a software developer it might  be straightforwardly feasible, requiring only a skill set shared by both:

There is at least one further usage scenario: that is, where there is confidential data on which a confidential algorithm must be run. In this situation, a client may encrypt their data to give to the developer of the algorithm and receive the results of the algorithm without either party compromising data or algorithm. In this situation, the constraints of homomorphic encryption are merely an opportunity cost because there may be no other way to achieve the same goal.

HomomorphicEncryption R package

I briefly describe in this section the R package that was specifically designed in this paper and research effort. It is an elegant high level programming language achievement adding a level of functionality to the researcher or practitioner in statistics which facilitates his/her effort handsomely:

The HomomorphicEncryption R package (Aslett, 2014) provides an easy to use interface to begin developing and testing statistical methods in a homomorphic environment. The package has been developed to be extensible, so that as new schemes are researched by cryptographers they can be made available for use by statistics researchers with minimal additional effort. The package has a small number of generic functions for which different cryptographic backends can be used. The underlying implementation is mostly in high performance C and C++ (Eddelbuettel et al., 2011), with many of the operations setup to utilise multi-core parallelism via multithreading (Allaire et al., 2014) without requiring any end-user intervention. 

The first generic cryptographic function is pars. The first argument to this function designates which cryptographic backend to use and allows the user to override any of the default parameters of that scheme (for example, d, q, t and σ of Section 2.3). Related to this, there is the alternative method of specifying parameters via the function parsHelp. This allows users to instead specify a desired minimal security level in bits and a minimal depth of multiplications required, and then computes values for d, q, t and σ which will satisfy these requirements with high probability, by automatically optimising established bounds from the literature (Lepoint and Naehrig, 2014; Lindner and Peikert, 2011).

The second generic cryptographic function is keygen, whose sole argument is a parameter object as returned by pars or parsHelp. keygen then generates a list containing public ($pk) and private ($sk) keys, along with any scheme dependent keys (such as relinearisation keys in the case of Fan and Vercauteren (2012)), which correspond to the homomorphic scheme designated by the parameter object. At this point, the parameter object is absorbed into the keys so that it doesn’t need to be used for any other functions.

The third generic cryptographic function is enc. This requires simply the public key (as returned in the $pk list element from keygen) and the integer message to be encrypted. It then returns a cipher text encrypted under the scheme to which the public key corresponds. Crucially, the ease of use begins to become very apparent here, with enc overloaded to enable encryption of not just individual integers, but also vectors and matrices of integers defined in R. The structure of the vectors and matrices are preserved and the encryption process is fully multithreaded across all available CPU cores automatically.

The final generic cryptographic function is dec. Similarly, this requires simply the private key, as returned in the $sk list element from keygen, and the (scalar/vector/matrix) cipher text to be decrypted. It then returns the original message. Note that the structure of vector or matrix cipher texts is correctly preserved throughout.

(…)

Moreover, vectors can be formed in the usual R manner using c (or extracted from the diagonal of matrix cipher texts with diag), element wise arithmetic can be performed on those vectors (with automatic multithreaded parallelism) and there is support for all the standard vector functions, such as length, sum, prod and %*% for inner products, just as one would conventionally use with unencrypted vectors in R.

Indeed, such functionality extends to matrices, with formation of diagonal matrices via diag from cipher text vectors, element wise arithmetic and full matrix multiplication using the usual %*% R operator (again, automatically fully parallelised). Matrices also support the usual matrix functions (dim, length, t, etc). The package automatically dispatches these operations to the correct backend cryptographic routines to perform the corresponding cipher text space operations transparently, returning cipher text result objects which can be used in further operations or decrypted.

The following is the simplest possible instructive example. Examining the contents of k, c1, etc will show the encryption detail:

homoencryptfig1

At this point in the paper I wondered if this R package and the automatic fully parallelized operations described weren’t even better performed with a GPU hardware instead of the mention CPU framework:

We hope this provides a distinctly easy-to-use software implementation in arguably the most popular high level language in use among data scientists today, including automatic help for encryption scheme parameter selection to aid non-cryptographers. Moreover, given the computational burden of homomorphic schemes, the transparent multithreaded parallelism automatically across all CPU cores in all available scenarios (encryption, decryption and arithmetic with vectors/matrices) enables focus to be on the subject matter questions.

homoencrypttab1

Conclusions

This wonderful paper, research effort and statistical computing package was one of the best papers I have read in the last couple of years (it is from August 2015). It strikes at two of the most important lines of research in Computer Science of our times: Machine Learning (or Data Science, more of in this case) and Cryptography (increasingly important in an age of Cloud Computing and data storage requirements). I found the time to do justice with this post, that I conclude with the main conclusions drawn by the authors as to the implications, current and future given the recognized limitations of their proposal, of further developments within the homomorphic encryption research area, development, adoption or implementation by interested organizations:

This technical report has provided a review of homomorphic encryption with a focus on issues which are pertinent to statisticians and machine learners. It also introduces the HomomorphicEncryption R package and demonstrates the ease of getting started experimenting with homomorphic encryption. The practical limitations of homomorphic encryption schemes means that existing techniques cannot always be directly translated into a corresponding secure algorithm.

This presents an opportunity for the statistics and machine learning community to engage with research in privacy preserving methods by developing new methods which are tailored to homomorphic computation and which work within the constraints described in Section 2.4, with the sister paper to this review (Aslett, Esperan¸ca and Holmes, 2015) being an initial contribution in this direction.

featured image: Homomorphic encryption

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s