We’re excited to bring Transform 2022 back in person on July 19 and pretty much July 20-28. Join AI and data leaders for insightful conversations and exciting networking opportunities. Register today!
As artificial intelligence broadens its horizons and breaks new ground, it increasingly challenges the imagination to open new frontiers. While new algorithms or models help address more and more business problems, advances in natural language processing (NLP) and language models are making programmers think inwardly about how to radically change the programming world.
With the evolution of multiple programming languages, the job of a programmer became more and more complex. While a good programmer can define a good algorithm, converting it into a relevant programming language requires knowledge of the syntax and available libraries, limiting a programmer’s ability across different languages.
Programmers have traditionally relied on their knowledge, experience, and repositories to build these code components in different languages. IntelliSense helped them with the correct syntactic clues. Advanced IntelliSense took it a step further with autocomplete statements based on syntax. Google (code) search / GitHub code search even listed similar code snippets, but the responsibility to trace the correct bits of code or script the code from scratch, assemble and contextualize it to the specific need rests solely with the programmers .
We are now seeing the evolution of intelligent systems that can understand the purpose of an atomic task, understand the context, and generate the appropriate code in the required language. This generation of contextual and relevant code can only take place if there is a good understanding of the programming languages and natural language. Algorithms can now understand these nuances in different languages, opening up a range of possibilities:
- Code conversion – understand code from one language and generate equivalent code in another language.
- Code documentation – generating the textual representation of a particular piece of code.
- Code generation – generating the correct code based on text input.
- Code validation – validating the alignment of the code with the given specification.
The evolution of code conversion is better understood when we look at Google Translate, which we use quite often for natural language translations. Google Translate learned the nuances of translation from a vast corpus of parallel datasets — source language statements and their equivalent target language statements – unlike traditional systems, which relied on translation rules between source and target languages.
Since it is easier to collect data than to write rules, Google Translate is scaled to translate between 100+ natural languages. Neural Machine Translation (NMT), a type of machine learning model, has Google Translate enabled to learn from a huge dataset of translation pairs. The efficiency of Google Translate inspired the first generation of machine learning-based programming language translators to use NMT. But the success of NMT-based programming language translators has been limited due to the unavailability of large-scale parallel data sets (supervised learning) in programming languages.
This led to unverified machine translation models that use large-scale monolingual codebase available in the public domain. These models learn from the monolingual code of the source programming language, then the monolingual code of the target programming language, and are then equipped to translate the code from the source to the target. Built on this approach, Facebook’s TransCoder is an unsupervised machine translation model trained on multiple monolingual codebases from open-source GitHub projects and capable of efficiently translating functions between C++, Java, and Python.
Code generation is currently evolving in different avatars – as a regular code generator or as a pair programmer that auto-completes a developer’s code.
The main technique used in the NLP models is transfer learning, where the models are pre-trained on large amounts of data and then refined based on targeted limited data sets. These are largely based on recurring neural networks. Recently, models based on Transformer architecture have proven to be more effective as they lend themselves to parallelization, speeding up computation. Models tailored for programming language generation can then be used for a variety of coding tasks, including code generation and generating unit test scripts for code validation.
We can also reverse this approach by applying the same algorithms to understand the code to generate relevant documentation. The traditional documentation systems focus on translating the old code into English line by line, giving us pseudocode. But this new approach can help summarize the code modules in extensive code documentation.
Programming language generation models available today include CodeBERT, CuBERT, GraphCodeBERT, CodeT5, PLBART, CodeGPT, CodeParrot, GPT-Neo, GPT-J, GPT-NeoX, Codex, etc.
DeepMind’s AlphaCode goes one step further and generates multiple code samples for the given descriptions while guaranteeing the given test conditions.
Code autocompletion follows the same approach as Gmail Smart Compose. As much as have experienced, Smart Compose prompts the user with real-time, context-specific suggestions, helping to compose emails faster. In fact, this is made possible by a neural language model trained on a bulk volume of emails from the Gmail domain.
Extending the same to the programming domain, a model that can predict the next set of rules in a program based on the past few lines of code is an ideal pair programmer. This significantly speeds up the development lifecycle, increases developer productivity, and improves code quality.
Not only can CoPilot autocomplete blocks of code, but it can also edit or insert content into existing code, making it a very powerful pair programmer with refactoring capabilities. CoPilot is powered by Codex, which has trained billions of parameters with a bulk of code from public repositories, including Github.
An important point to note is that we are probably in a transition phase where pair programming essentially works in the human-in-the-loop approach, which in itself is an important milestone. But the final destination is undoubtedly the autonomous code generation. However, the evolution of AI models that evoke trust and accountability will define that journey.
Generating code for complex scenarios that require more troubleshooting and logical reasoning is still challenging, as it can justify the generation of code that has not been encountered before.
Understanding the current context to generate the correct code is limited by the size of the model’s context window. The current set of programming language models supports a context size of 2,048 tokens; Codex supports 4,096 tokens. The examples in single-shot learning models consume some of these tokens and only the remaining tokens are available for developer input and model-generated output, while zero-shot learning/fine-tuned models reserve the entire context window for the input and output .
Most language models require high computational power because they are built on billions of parameters. Applying these in different business contexts can put a strain on compute budgets. Currently, a lot of attention is being paid to optimizing these models to allow for easier adoption.
In order for these code generation models to work in pair programming mode, the inference time of these models must be reduced so that their predictions are rendered to developers in less than 0.1 second in their IDE to make one. seamless experience.
Kamalkumar Rathinasamy trains the machine learning-based machine programming group infosysfocused on building machine learning models to extend coding tasks.
Vamsi Krishna Oruganti is an automation enthusiast, leading the implementation of AI and automation solutions for financial services clients infosys†
Welcome to the VentureBeat Community!
DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.
If you want to read about the latest ideas and up-to-date information, best practices and the future of data and data technology, join us at DataDecisionMakers.
You might even consider contributing an article yourself!
Read more from DataDecisionMakers