Home / Blogs

Is It Time for a Data Sharing Clearinghouse for Internet Researchers?

Nick Feamster

Today's Senate hearing with Facebook's Mark Zuckerberg will start a long discussion on data collection and privacy from Internet companies. Although the spotlight is currently on Facebook, we shouldn't forget that the picture is broader: companies from device manufacturers to ISPs collect network traffic and use it for a variety of purposes.

The uses that we will hear about today are largely about the widespread collection of data about Internet users for targeted content delivery and advertising. Meanwhile, yesterday Facebook announced an initiative to share data with independent researchers to study social media's impact on elections. At the same time Facebook is being raked over the coals for sharing their data with "researchers" (Cambridge Analytica), they've announced a program to share their data with (presumably more "legitimate") researchers.

Internet researchers depend on data. Sometimes, we can gather the data ourselves, using measurement tools deployed at the edge of the Internet (e.g., in-home networks, on phones). In other cases, we need data from the companies that operate parts of the Internet, such as an Internet service provider (ISP), an Internet registrar, or an application provider (e.g., Facebook).

  • If incentives align, data flows to the researcher. Interacting with a company can work very well when goals are aligned. I've worked well with companies to develop new spam filtering algorithms, to develop new botnet detection algorithms, and to highlight empirical results that have informed policy debates.
  • If incentives do not align, then the researcher probably won't get the data. When research is purely technical, incentives often align. When the technical work crosses over into policy (as it does in areas like net neutrality, and as we are seeing with Facebook), there can be (insurmountable) hurdles to data access.

How an Internet Researcher Gets Data Today

How do Internet researchers get data from companies today? An Internet operator I know aptly characterizes the status quo:

"Show Internet operators you can do something useful, and they'll give you data."

Researchers get access to Internet data from companies in two ways: (1) working for the company (as an employee), or (2) working with the company (as an "independent" researcher).

Option #1: Work for a Company.

Working for a company offers privileged access to data, which can be used to mint impressive papers (irreproducibility aside) simply because nobody else has the same data. I have taken this approach myself on a number of occasions, having worked for an ISP (AT&T), a DNS company (Verisign), and an Internet security service provider (Damballa).

How this approach works. In the 2000s, research labs at AT&T and Sprint had privileged access to data, which gave rise to a proliferation of papers on "Everything You Wanted to Know About the Internet Backbone But Were Afraid to Ask". Today, the story repeats itself, except that the players are Google and Facebook, and the topic du jour is data-center networks.

Shortcomings of This Approach. Much research — from projects with a longer arc to certain policy-oriented questions — would never come to light if we only relied on company employees to do it. By the nature of their work, however, company employees lack independence. They lack both autonomy of selecting problems and in the ability to take positions or publish results that run counter to the company's goals or priorities. This shortcoming may not matter if what the researcher wants to work on and what the company wants to accomplish are the same. For many technical problems, this is the case (although there is still the tendency for the technical community to develop tunnel vision around areas where there is an abundance of data, while neglecting other areas). But for many problems — ranging from problems with a longer arc to deployment to those that may run counter to priorities — we can't rely on industry to do the work.

#2: Work with a Company.

How this approach works. A researcher may instead work with a company, typically gaining privileged access to data for a particular project. Sometimes, we demonstrate the promise of a technique with some data that we can gather or bootstrap without any help and use that initial study to pique the interest of a company who may then share data with us to further develop the idea.

Shortcomings of this approach. Research done in collaboration with a company often has similar shortcomings as the research that is done within a company's walls. If the results of the research align with the company's perspectives and viewpoints, then data sharing is copacetic. Even these cooperative settings do pose some risks to researchers, who may create the perception that they are not independent, merely by their association with the company. With purely technical research risks are lower, though still non-zero: for example, because the work depends on privileged data access, the researcher may still face challenges in presenting the research in a way that could help others reproduce it in the future.

With technical work that can inform or speak to policy questions, there are some concerns. First, certain types of research or results may never come to light — if a company doesn't like the result that may result from the data analysis, then they may simply not share the data, or they may ask for "pre-publication review" for results based on that data (this practice is common for research that is conducted within companies as well). There is also a second, more subtle concern. Even when the work is technically watertight, a researcher can still face questions — fair or unfair — about the soundness of the work due to the perceived motivations or agendas of cooperative parties involved.

Current Data Sharing Approaches are Good, But They are Not Sufficient

The above methods for data sharing can work well for certain types of research. In my career, I have made hay playing by these rules — often working with a company, first by demonstrating the viability of an idea with a smaller dataset that we gather ourselves and "pitching" the idea to a company.

Yet, in my experience, these approaches have two shortcomings. The first relates to incentives. The second relates to privacy.

Problem #1: Incentives.

Certain types of work depend on access to Internet data, but the company who holds the data may not have a direct incentive to facilitate the research. Possible studies of Facebook's effect on elections certainly fall into this category: They simply may not like the results of the research.

But, there are plenty of other lines of research that fall into the category where incentives may not align. Other examples range from measurements of Internet capacity and performance as they relate to broadband regulation (e.g., net neutrality) to evaluation of an online platform's content moderation algorithms and techniques. Lots of other work relating to consumer protection falls into this category as well. We have to rely on users and researchers measuring things at the edge of the network to figure out what's going on; from this vantage point, certain activities may naturally slip under the radar more easily.

The current Internet data sharing roadmap doesn't paint a rosy picture for research where incentives don't align. Even when incentives do align, there can be perceptions of "capture" — effectively shilling an intellectual or technical finding in exchange for data access.

It is in the interests of everyone — the academics and their industry partners alike — to establish more formal modes of data exchange when either (1) there is determination that the problem is important to study for the health of the Internet, or for the benefit of consumers; (2) there is the potential that the research will be perceived as not objective due to the nature of the data sharing agreement.

Problem #2: Privacy.

Sharing Internet data with researchers can introduce substantial privacy risks, and the need to share data with any researcher who works with a company should be evaluated carefully — ideally by an independent third party.

When helping develop the researcher exception to the FCC's broadband privacy rules, I submitted a comment that proposed the following criteria for sharing ISP data with researchers:

  1. Purpose of research. The data satisfies research that aims to promote security, stability, and reliability of networks. The research should have clear benefits for Internet innovation, operations, or security.
  2. Research goals do not violate privacy. The goals of the research does not include compromising consumer privacy;
  3. Privacy risks of data sharing are offset by benefits of the research. The risks of the data exchange are offset by the benefits of the research;
  4. Privacy risks of the data sharing are mitigated. Researchers should strive to use de-­identified data wherever possible.
  5. The data adds value to the research. The research is enhanced by access to the data.

Yet, outlining the criteria is one thing. The thornier question (which we did not address!) is: Who gets to decide the answers?

Universities have institutional review boards that can help evaluate the merits of such a data sharing agreement. But, Cambridge Analytica might have the veneer of "research," and a company may have no internal incentive to independently evaluate the data sharing agreement on its merits. In light of recent events, we may be headed towards the conclusion that such data-sharing agreements should always be vetted by independent third-party review. If the research doesn't involve a university, however, the natural question is: Who is that third party?

Looking Ahead: Data Clearinghouses for Internet Data?

Certain types of Internet research — particularly those that involve thorny regulatory or policy questions — could benefit from an independent clearinghouse, where researchers could propose studies and experiments for independent evaluation and have them evaluated and selected by an independent third party, based on their benefits and risks. Facebook is exploring this avenue in the limited setting of election integrity. This is an exciting step.

Moving forward, it will be interesting to see how Facebook's meta-experiment on data sharing plays out, and whether it — or some variant — can serve as a model for Internet data sharing for other types of work writ large. In purely technical areas, such a clearinghouse could allow a broader range of researchers to explore, evaluate, reproduce and extend the types of work that for now remains largely irreproducible because data is under lock and key. For these questions, there could be significant benefit to the scientific community. In areas where the technical work or data analysis informs policy questions, the benefits to consumers could be even greater.

By Nick Feamster, Professor at Princeton University
SHARE THIS POST

If you are pressed for time ...

... this is for you. More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.

I make a point of reading CircleID. There is no getting around the utility of knowing what thoughtful people are thinking and saying about our industry.

Vinton Cerf, Co-designer of the TCP/IP Protocols & the Architecture of the Internet

Share your comments

Infostack framework Michael Elling  –  May 16, 2018 2:39 AM PDT

The FCC should be tasked with collecting data to ensure that pricing more or less aligns with marginal costs so that from a societal perspective supply clears demand efficiently.  This is accomplished by mandating interconnection as far out and down in the stack as possible and then guiding, not regulating, interconnection agreements.  Very quickly horizontally scaled business model will develop (more along the lines of the OSI, not IP stack) with settlements that flow both north-south and east-west.  In today's internet there is no settlements flowing east-west and there is a tendency towards monopoly north-south, such as AWS, to maintain an early lead or monopoly in the application layer.  So the north-south flow is broken by the lack of east-west settlements. 

A referential framework that captures geodensity, network boundaries and traffic flows east-west, along with a network function/protocol stack resembling the OSI model north-south, and finally application/market segment differences along a 3rd vector/axis is referred to as the informational stack, or Infostack: http://bit.ly/142RTFk

Data gathered in such a reference framework could be aggregated, keeping private critical corporate and individual data.  Moreover such data should be freely and easily accessible, making the job of the regulator as a monitor and adjudicator should disputes arise that much easier.  Of course the regulator would need to develop its own analytical group to maintain objective.  But in the process, competitive forces (pricing, rates of investment, technology propagation and duration, capacity utilization, etc...) would be more transparent serving society's goals (primarily market driven universal service) better.

To post comments, please login or create an account.

Related

Topics

New TLDs

Sponsored byAfilias

IP Addressing

Sponsored byAvenue4 LLC

Domain Names

Sponsored byVerisign

DNS Security

Sponsored byAfilias

Cybersecurity

Sponsored byVerisign