APIful Blog We're obsessed with Web APIs

Understanding Real API Practices via Big Code Analysis

Jul 13, 2016 •

Nowadays, more and more software projects are distributed under friendly open source licenses and hence provide more flexibility to their users and other developers. Thanks to this social coding approach, massive real-world code bases built atop various trending web APIs are publicly available.

Existing source code that interacts with web APIs encodes a lot of knowledge and experience. Usually, to consume an API from an application, developers need to understand API specifications, figure out correct request settings, unmarshal responses, integrate web API requests in meaningful ways, and finally make sure that APIs behave appropriately via several rounds of careful debugging and testing.

Presumably, every developer who wants to consume web APIs has to go through similar, time-consuming procedures. For API Harmony, we asked ourselves if we could leverage existing source code to make things easier for developers? Specifically, could we leverage real-world massive code bases to learn how to use web APIs? And how can we make this work?

Benefiting from Big Code Analysis

Following the guidelines of big code analysis, both developers and API providers could benefit from real-world API usage information. The advantages include, but are not limited to, the following:

  • Developers may re-use real-world examples of invoking the same APIs as tutorials or code templates.

  • Developers may obtain feedback about inappropriate API practices (e.g., wrong endpoints, potentially missing parameters or incorrect parameter assignments, new API version releases, etc.).

  • Developers may get recommendations on popular API usages. For example, they could learn popular combinations of request parameters. In addition, they could have rough ideas about the data structure of the responses. Although potentially incomplete, such information may be helpful.

  • If one API usually consumes the response of another, developers may learn such dependencies and see how they are assembled together in real-world examples.

  • When API documentations or API specifications are incomplete or unavailable, developers may still be able to learn how to play with APIs from code examples. Under some conditions, even inferring or generating API specifications from API usages may be possible.

  • Inconsistencies among API specifications, documentations and code can be detected.

  • If combined with advanced understanding of API specifications, API providers may be notified and even motivated to update their specifications or documentations, if imprecise.

  • API providers could get insights about hot APIs and the preferences of their users, which may be helpful for the API design and optimizations.

Although the benefits look encouraging, to accomplish them, a systematic way to understand web API usages in massive code bases is needed. As the first step, extracting API usage information from arbitrary programs is key to do this. Therefore, in this post, we will describe our approach to web API usage extractions. Details on other components of our solution to provide insights from real API usage will be revealed soon in follow-up posts.

Usage Extraction via Program Analysis

Given JavaScript programs are among popular artifacts interacting with Web APIs, in this section, we use a simplified version of an example found on GitHub to explain our approach.

This snippet invokes two endpoints of the Instagram APIs: the first one in lines 7-14 is used to retrieve the user ID. The second invocation in lines 17-26 loads a recent media posted by a given user.

For each Web API invocation, we are interested in identifying the following information:

  • Request parameters and their assignments. They suggest which API endpoint is invoked and how this request is invoked. For example, the first API invocation is a GET request sent to https://api.instagram.com/v1/users/search?q=XXX&access_token=YYY. We also want to resolve the values of variables usuario and token in the URL string concatenation so that we can obtain the complete request string. Then, we know the values of parameters q and access_token.

  • A list of fields in the response accessed in the response event handlers. They reveal the data structure of the response data and what the unmarshalling logic looks like. E.g., in the success response event handler of the first request, data.data[0].id is accessed.

  • Dependencies among API invocations. They could be indicators of interesting data flows among APIs. For instance, the request URL of the second API invocation includes variable userID, which is defined by the response handler of the first request in line 12.

What are the challenges?

As shown in the above example, we would need to understand the data flows (e.g., variable define-use chains and potentially aliasing relations) and control flows (e.g., function calls and all possible execution paths) in the program. In addition, we want to analyze wide variety of snippets found in massive code bases. These requirements make extracting web API usage information difficult.

Dynamic analysis may not be an option

We want to be able to perform analysis on code that is publicly available, but does not necessarily have running services that we can invoke. Invocation of web APIs often times require API keys. Dynamic analysis would need to provision keys to register and agree to terms of services, which are not always easy to understand by a layman. Terms of service are furthermore barely encoded in machine readable form and do thus not allow a program to decide if we can comply with them. Finally, even if the above problems were solved, dynamic analysis always has the challenge of ensuring proper code coverage.

Pattern based text search won’t work

A simple static static approach is to utilize searches based on regular expressions, such as grep, to extract base URL related strings. However, usually the URL in a request is not a simple constant in real-world applications, but instead is assembled during the execution (e.g., the URL of the second request in lines 21-22 in the above example). Consequently, grep is less effective, since resolving the URL and its parameters requires nontrivial data flow analysis.

Our solution: static program analysis + string analysis

Analyzing JavaScript statically has been known to be difficult due to the language’s dynamic features. However, it is still feasible to extract fairly accurate web API usage information via static analysis. The following figure shows the overview of our web API usage extractor. Its input is a program mined from code search. The output is data on extracted web API usage, including URLs and request payloads.

Call Graph Construction

The first step in the analysis is to build a directed graphical representation of the calling relations among functions in the program. With the call graph, we can figure out the execution order of statements and further understand the complicated data/control flows in the program.

However, static program analysis of JavaScript has proven to be a challenge to scale to the whole-program analysis of framework-based Web applications. We leverage a field-based call graph constructor to obtain a fairly precise call graph.

Locating request invocations

As we are only interested in API invocations, we first identify statements that make such requests. This is done by looking for patterns on the call graph, which are framework-specific (e.g., for jQuery, the patterns are function calls to $.ajax, $.get, $.post, etc.). In particular, if a statement makes a request call, we remember such instructions and use them as the seeds for the interprocedural data flow analysis. Therefore, we don’t produce analysis output for a script if no matched invocation statement is found.

Understanding request preparations and response handling

In the next step, we extract the statements that prepare the web API invocations. Starting from each request function call captured in the previous step, we compute where and how variables are defined. We recover all possible flows that lead to the request, track down variable definitions and assemble pieces together. If the value of a variable cannot be determined statically (e.g., variable userID in line 22 is defined by the response hander of the first request in line 12), we use a special symbol to denote its nondeterministic and symbolic value. Since strings and string operations are intensively seen in web applications, we model common string operators (e.g., concat and encodeURI) and support symbolic values. This allows us also to track the dependencies among APIs by tracking the define-use chains of these symbolic variables.

Identifying the data structure in the response follows a similar procedure based on the data flow analysis.

Big Code Analysis Applied in Practice

Using the outlined approach, we can extract web APIs usage information from massive code bases. Our group is actively working with this data and building services atop the insights obtained from instances of real-world practices.

See in preview mode something we do with this information on API Harmony for an Instagram endpoint:

  1. Follow this link and click Code Snippets.
  2. Pick JavaScript + JQuery in the Select a language / library drop-down list.
  3. Select GET Recent: Get the most recent media published by a u... (147 occurrences found on GitHub) in the Select an API Endpoint drop-down list.
  4. Check out the lessons learned conventions discovered from code instances on GitHub.

We have many ideas on how to make these lessons learned valuable to developers. Stay tuned for future posts! In the meantime, what would you like to see?

Share via a Tweet or follow us for everything APIs!