Nowadays, more and more software projects are distributed under friendly open source licenses and hence provide more flexibility to their users and other developers. Thanks to this social coding approach, massive real-world code bases built atop various trending web APIs are publicly available.
Existing source code that interacts with web APIs encodes a lot of knowledge and experience. Usually, to consume an API from an application, developers need to understand API specifications, figure out correct request settings, unmarshal responses, integrate web API requests in meaningful ways, and finally make sure that APIs behave appropriately via several rounds of careful debugging and testing.
Presumably, every developer who wants to consume web APIs has to go through similar, time-consuming procedures. For API Harmony, we asked ourselves if we could leverage existing source code to make things easier for developers? Specifically, could we leverage real-world massive code bases to learn how to use web APIs? And how can we make this work?
Benefiting from Big Code Analysis
Following the guidelines of big code analysis, both developers and API providers could benefit from real-world API usage information. The advantages include, but are not limited to, the following:
Developers may re-use real-world examples of invoking the same APIs as tutorials or code templates.
Developers may obtain feedback about inappropriate API practices (e.g., wrong endpoints, potentially missing parameters or incorrect parameter assignments, new API version releases, etc.).
Developers may get recommendations on popular API usages. For example, they could learn popular combinations of request parameters. In addition, they could have rough ideas about the data structure of the responses. Although potentially incomplete, such information may be helpful.
If one API usually consumes the response of another, developers may learn such dependencies and see how they are assembled together in real-world examples.
When API documentations or API specifications are incomplete or unavailable, developers may still be able to learn how to play with APIs from code examples. Under some conditions, even inferring or generating API specifications from API usages may be possible.
Inconsistencies among API specifications, documentations and code can be detected.
If combined with advanced understanding of API specifications, API providers may be notified and even motivated to update their specifications or documentations, if imprecise.
API providers could get insights about hot APIs and the preferences of their users, which may be helpful for the API design and optimizations.
Although the benefits look encouraging, to accomplish them, a systematic way to understand web API usages in massive code bases is needed. As the first step, extracting API usage information from arbitrary programs is key to do this. Therefore, in this post, we will describe our approach to web API usage extractions. Details on other components of our solution to provide insights from real API usage will be revealed soon in follow-up posts.
Usage Extraction via Program Analysis
This snippet invokes two endpoints of the Instagram APIs: the first one in lines
7-14 is used to retrieve the user ID. The second invocation in lines
17-26 loads a recent media posted by a given user.
For each Web API invocation, we are interested in identifying the following information:
Request parameters and their assignments. They suggest which API endpoint is invoked and how this request is invoked. For example, the first API invocation is a
GETrequest sent to
https://api.instagram.com/v1/users/search?q=XXX&access_token=YYY. We also want to resolve the values of variables
tokenin the URL string concatenation so that we can obtain the complete request string. Then, we know the values of parameters
A list of fields in the response accessed in the response event handlers. They reveal the data structure of the response data and what the unmarshalling logic looks like. E.g., in the
successresponse event handler of the first request,
Dependencies among API invocations. They could be indicators of interesting data flows among APIs. For instance, the request URL of the second API invocation includes variable
userID, which is defined by the response handler of the first request in line
What are the challenges?
As shown in the above example, we would need to understand the data flows (e.g., variable define-use chains and potentially aliasing relations) and control flows (e.g., function calls and all possible execution paths) in the program. In addition, we want to analyze wide variety of snippets found in massive code bases. These requirements make extracting web API usage information difficult.
Dynamic analysis may not be an option
We want to be able to perform analysis on code that is publicly available, but does not necessarily have running services that we can invoke. Invocation of web APIs often times require API keys. Dynamic analysis would need to provision keys to register and agree to terms of services, which are not always easy to understand by a layman. Terms of service are furthermore barely encoded in machine readable form and do thus not allow a program to decide if we can comply with them. Finally, even if the above problems were solved, dynamic analysis always has the challenge of ensuring proper code coverage.
Pattern based text search won’t work
A simple static static approach is to utilize searches based on regular expressions, such as
grep, to extract base URL related strings. However, usually the URL in a request is not a simple constant in real-world applications, but instead is assembled during the execution (e.g., the URL of the second request in lines
21-22 in the above example). Consequently,
grep is less effective, since resolving the URL and its parameters requires nontrivial data flow analysis.
Our solution: static program analysis + string analysis
Call Graph Construction
The first step in the analysis is to build a directed graphical representation of the calling relations among functions in the program. With the call graph, we can figure out the execution order of statements and further understand the complicated data/control flows in the program.
Locating request invocations
As we are only interested in API invocations, we first identify statements that make such requests. This is done by looking for patterns on the call graph, which are framework-specific (e.g., for
jQuery, the patterns are function calls to
$.post, etc.). In particular, if a statement makes a request call, we remember such instructions and use them as the seeds for the interprocedural data flow analysis. Therefore, we don’t produce analysis output for a script if no matched invocation statement is found.
Understanding request preparations and response handling
In the next step, we extract the statements that prepare the web API invocations. Starting from each request function call captured in the previous step, we compute where and how variables are defined. We recover all possible flows that lead to the request, track down variable definitions and assemble pieces together. If the value of a variable cannot be determined statically (e.g., variable
userID in line
22 is defined by the response hander of the first request in line
12), we use a special symbol to denote its nondeterministic and symbolic value. Since strings and string operations are intensively seen in web applications, we model common string operators (e.g.,
encodeURI) and support symbolic values. This allows us also to track the dependencies among APIs by tracking the define-use chains of these symbolic variables.
Identifying the data structure in the response follows a similar procedure based on the data flow analysis.
Big Code Analysis Applied in Practice
Using the outlined approach, we can extract web APIs usage information from massive code bases. Our group is actively working with this data and building services atop the insights obtained from instances of real-world practices.
See in preview mode something we do with this information on API Harmony for an Instagram endpoint:
- Follow this link and click
Select a language / librarydrop-down list.
GET Recent: Get the most recent media published by a u... (147 occurrences found on GitHub)in the
Select an API Endpointdrop-down list.
- Check out the lessons learned conventions discovered from code instances on GitHub.
We have many ideas on how to make these lessons learned valuable to developers. Stay tuned for future posts! In the meantime, what would you like to see?