blob: 4356eed9a5d9fe0edb905151e0a8f0c01c2b478c [file] [log] [blame]
<!DOCTYPE html>
<html lang='en' dir='auto'><head>
<meta charset='utf-8'>
<meta name='viewport' content='width=device-width, initial-scale=1'>
<meta name='description' content='Introduction This document presents the datasets generated for Eclipse DataEggs, discusses the implications it has regarding privacy, and describes what has been achieved to ensure data is safe.
All datasets are anonymised: fields that could be used to identify individuals or companies either directly or indirectly have been transformed using the Anonymise::Utility Perl module.
The intended audience of the datasets is composed of:
Research laboratories, mainly in the field of software engineering.'>
<meta name='theme-color' content='#ffcd00'>
<meta property='og:title' content='Datasets Privacy • Eclipse DataEggs'>
<meta property='og:description' content='Introduction This document presents the datasets generated for Eclipse DataEggs, discusses the implications it has regarding privacy, and describes what has been achieved to ensure data is safe.
All datasets are anonymised: fields that could be used to identify individuals or companies either directly or indirectly have been transformed using the Anonymise::Utility Perl module.
The intended audience of the datasets is composed of:
Research laboratories, mainly in the field of software engineering.'>
<meta property='og:url' content='https://download.eclipse.org/dataeggs/privacy/'>
<meta property='og:site_name' content='Eclipse DataEggs'>
<meta property='og:type' content='article'><meta property='article:section' content='page'><meta name='twitter:card' content='summary'>
<meta name="generator" content="Hugo 0.83.1" />
<title>Datasets Privacy • Eclipse DataEggs</title>
<link rel='canonical' href='https://download.eclipse.org/dataeggs/privacy/'>
<link rel='icon' href='/dataeggs/favicon.ico'>
<link rel='stylesheet' href='/dataeggs/assets/css/main.ab98e12b.css'><link rel='stylesheet' href='/dataeggs/css/custom.css'><style>
:root{--color-accent:#ffcd00;}
</style>
<script type="application/javascript">
var doNotTrack = false;
if (!doNotTrack) {
window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;
ga('create', 'UA-3675452-15', 'auto');
ga('send', 'pageview');
}
</script>
<script async src='https://www.google-analytics.com/analytics.js'></script>
</head>
<body class='page type-page has-sidebar has-emoji'>
<div class='site'><div id='sidebar' class='sidebar'>
<a class='screen-reader-text' href='#main-menu'>Skip to Main Menu</a>
<div class='container'><section class='widget widget-about sep-after'>
<header>
<div class='logo'>
<a href='/dataeggs/'>
<img src='/dataeggs/images/dataeggs-menu.png'>
</a>
</div>
<div class='desc'>
Open. Safe. Easy.
</div>
</header>
</section>
<section class='widget widget-search sep-after'>
<header>
<h4 class='title widget-title'>Search</h4>
</header>
<form action='/dataeggs/search' id='search-form' class='search-form'>
<label>
<span class='screen-reader-text'>Search</span>
<input id='search-term' class='search-term' type='search' name='q' placeholder='Search&hellip;'>
</label></form>
</section>
<section class='widget widget-sidebar_menu sep-after'><nav id='sidebar-menu' class='menu sidebar-menu' aria-label='Sidebar Menu'>
<div class='container'>
<ul><li class='item'>
<a href='/dataeggs/'>Home</a></li><li class='item current'>
<a aria-current='page' href='/dataeggs/privacy/'>Privacy</a></li><li class='item'>
<a href='/dataeggs/aeri_stacktraces/'>AERI</a></li><li class='item'>
<a href='/dataeggs/eclipse_mls/'>MLS</a></li></ul>
</div>
</nav>
</section><section class='widget widget-social_menu sep-after'><nav aria-label='Social Menu'>
<ul><li>
<a href='https://gitlab.eclipse.org/dataeggs/dataeggs' target='_blank' rel='noopener me'>
<span class='screen-reader-text'>Open Gitlab account in new tab</span><svg class='icon' xmlns='http://www.w3.org/2000/svg' viewbox='0 0 24 24' stroke-linecap='round' stroke-linejoin='round' stroke-width='2' aria-hidden='true'>
<title>GitLab icon</title> <path d="M22.65 14.39L12 22.13 1.35 14.39a.84.84 0 0 1-.3-.94l1.22-3.78 2.44-7.51A.42.42 0 0 1 4.82 2a.43.43 0 0 1 .58 0 .42.42 0 0 1 .11.18l2.44 7.49h8.1l2.44-7.51A.42.42 0 0 1 18.6 2a.43.43 0 0 1 .58 0 .42.42 0 0 1 .11.18l2.44 7.51L23 13.45a.84.84 0 0 1-.35.94z"/>
</svg>
</a>
</li><li>
<a href='mailto:boris@chrysalice.org' target='_blank' rel='noopener me'>
<span class='screen-reader-text'>Contact via Email</span><svg class='icon' xmlns='http://www.w3.org/2000/svg' viewbox='0 0 24 24' stroke-linecap='round' stroke-linejoin='round' stroke-width='2' aria-hidden='true'>
<path d="M4 4h16c1.1 0 2 .9 2 2v12c0 1.1-.9 2-2 2H4c-1.1 0-2-.9-2-2V6c0-1.1.9-2 2-2z"></path><polyline points="22,6 12,13 2,6"></polyline>
</svg>
</a>
</li></ul>
</nav>
</section></div>
<div class='sidebar-overlay'></div>
</div><div class='main'><a class='screen-reader-text' href='#content'>Skip to Content</a>
<button id='sidebar-toggler' class='sidebar-toggler' aria-controls='sidebar'>
<span class='screen-reader-text'>Toggle Sidebar</span>
<span class='open'><svg class='icon' xmlns='http://www.w3.org/2000/svg' viewbox='0 0 24 24' stroke-linecap='round' stroke-linejoin='round' stroke-width='2' aria-hidden='true'>
<line x1="3" y1="12" x2="21" y2="12" />
<line x1="3" y1="6" x2="21" y2="6" />
<line x1="3" y1="18" x2="21" y2="18" />
</svg>
</span>
<span class='close'><svg class='icon' xmlns='http://www.w3.org/2000/svg' viewbox='0 0 24 24' stroke-linecap='round' stroke-linejoin='round' stroke-width='2' aria-hidden='true'>
<line x1="18" y1="6" x2="6" y2="18" />
<line x1="6" y1="6" x2="18" y2="18" />
</svg>
</span>
</button><div class='header-widgets'>
<div class='container'>
<style>.widget-breadcrumbs li:after{content:'\2f '}</style>
<section class='widget widget-breadcrumbs sep-after'>
<nav id='breadcrumbs'>
<ol><li><a href='/dataeggs/'>Home</a></li><li><span>Privacy</span></li></ol>
</nav>
</section></div>
</div>
<header id='header' class='header site-header'>
<div class='container sep-after'>
<div class='header-info'><p class='site-title title'>Eclipse DataEggs</p><p class='desc site-desc'></p>
</div>
</div>
</header>
<main id='content'>
<article lang='en' class='entry'>
<header class='header entry-header'>
<div class='container sep-after'>
<div class='header-info'>
<h1 class='title'>Datasets Privacy</h1>
</div>
</div>
</header>
<div class='container entry-content'>
<h2 id="introduction">Introduction</h2>
<p>This document presents the datasets generated for Eclipse DataEggs, discusses the implications it has regarding privacy, and describes what has been achieved to ensure data is safe.</p>
<p>All datasets are anonymised: fields that could be used to identify individuals or companies either directly or indirectly have been transformed using the <a href="https://github.com/borisbaldassari/data-anonymiser">Anonymise::Utility Perl module</a>.</p>
<p>The intended audience of the datasets is composed of:</p>
<ul>
<li>Research laboratories, mainly in the field of software engineering.</li>
<li>Software engineering practitioners, who may find useful to have real-world examples of software development projects.</li>
</ul>
<p>Should one have questions or remarks on the datasets, please <a href="mailto:boris@chrysalice.org">feel free to contact us</a>. All cases related to privacy will be handled with utmost diligence.</p>
<h2 id="description-of-the-datasets">Description of the datasets</h2>
<p>There are three types of datasets generated, each with its specific schema and attributes. The first step to preserve privacy is to describe the various datasets and their attributes, and identify what field could pose a threat.</p>
<h3 id="aeri-stacktraces">AERI stacktraces</h3>
<p>The <a href="../aeri_stacktraces/">AERI stacktraces dataset</a> contains information about exceptions encountered by users in the Eclipse IDE. It includes data about the exception itself, and the environment where it happened.</p>
<p>The <a href="../aeri_stacktraces#format-incidents">incidents dataset</a> offers the following attributes:</p>
<ul>
<li><strong>Message</strong> (String) A short text summarising the error.</li>
<li><strong>Code</strong> (Integer) The numeric status code logged with the error.</li>
<li><strong>Severity</strong> (Factors) An estimate by the user reporting the error about its perceived severity.</li>
<li><strong>Kind</strong> (Factors) The type of error recorded, as identified by the AERI system.</li>
<li><strong>Plugin ID</strong> (String) The ID of the Eclipse plugin that threw the exception.</li>
<li><strong>Plugin Version</strong> (String) The ID of the Eclipse plugin that threw the exception.</li>
<li><strong>Status fingerprint</strong> (String) An identifier for the status of the incident. Used for duplicates detection.</li>
<li><strong>Incident fingerprint</strong> (String) An identifier for the incident. Used for duplicates detection.</li>
<li><strong>Incident fingerprint2</strong> (String) An identifier for the incident. Used for duplicates detection.</li>
<li><strong>Timestamp</strong> (Date ISO 8601) The time of creation of the incident.</li>
<li><strong>Saved On</strong> (Date ISO 8601) The time of last save of the problem.</li>
<li><strong>OSGi Architecture</strong> (Factors) The architecture of the host, as specified in the OSGi bundle definition.</li>
<li><strong>OSGi OS</strong> (Factors) The host operating system, as reported in OSGi.</li>
<li><strong>OSGi OS Version</strong> (Factors) The host operating system version, as reported in OSGi.</li>
<li><strong>OSGi Window Manager</strong> (Factors) The Window Manager used by the host, as reported in OSGi.</li>
<li><strong>Eclipse Build ID</strong> (String) The Build ID of the Eclipse instance running when the exception occurred.</li>
<li><strong>Eclipse Product</strong> (String) The Eclipse product impacted by the exception.</li>
<li><strong>Java runtime version</strong> (String) The Java runtime of the host.</li>
<li><strong>Comment Quality</strong> (Factors) An estimate of the user comment’s quality (throughfulness). User comments help people better understand the context of the exception.</li>
</ul>
<p>The [problems dataset](../aeri_stacktraces&quot; &gt;}}#format-problems) offers the following attributes:</p>
<ul>
<li><strong>Summary</strong> (String) A short text summarising the error.</li>
<li><strong>Number of reporters</strong> (Integer) The number of people who reported this incident or problem.</li>
<li><strong>Number of incidents</strong> (integer) The number of times this problem was identified in incidents.</li>
<li><strong>V1 Status</strong> (Factors) The status of the problem attached to the error report.</li>
<li><strong>Kind</strong> (Factors) The type of error recorded, as identified by the AERI system.</li>
<li><strong>Created On</strong> (Date ISO 8601) The time of first appearance of the problem in an incident.</li>
<li><strong>Modified On</strong> (Date ISO 8601) The time of last update of the problem in an incident.</li>
<li><strong>Saved On</strong> (Date ISO 8601) The time of last save of the problem.</li>
<li><strong>OSGi Architecture</strong> (Factors) The architecture of the host, as specified in the OSGi bundle definition.</li>
<li><strong>OSGi OS</strong> (Factors) The host operating system, as reported in OSGi.</li>
<li><strong>OSGi OS Version</strong> (Factors) The host operating system version, as reported in OSGi.</li>
<li><strong>OSGi Window Manager</strong> (Factors) The Window Manager used by the host, as reported in OSGi.</li>
<li><strong>Eclipse Build ID</strong> (String) The Build ID of the Eclipse instance running when the exception occurred.</li>
<li><strong>Eclipse Product</strong> (String) The Eclipse product impacted by the exception.</li>
<li><strong>Java runtime version</strong> (String) The Java runtime of the host.</li>
</ul>
<p>The <a href="https://download.eclipse.org/dataeggs/aeri_stacktraces//incidents_bundles_extract.csv.bz2">incidents bundle</a> offers the following attributes:</p>
<ul>
<li><strong>Bundle name</strong> (String) The name of the bundle impacted by the exception.</li>
<li><strong>Bundle version</strong> (String) The version of the bundle impacted by the exception.</li>
<li><strong>Value</strong> (Integer) The number of times the exception appeared for this bundle (name + version).</li>
</ul>
<h3 id="eclipse-mailing-lists">Eclipse Mailing lists</h3>
<p>The <a href="../eclipse_mls/mbox_analysis.html">Eclipse mailing lists dataset</a> offers the following attributes:</p>
<ul>
<li><strong>List</strong> (String) The mailing list and project of the post.</li>
<li><strong>messageId</strong> (String) A unique identifier for the post.</li>
<li><strong>Subject</strong> (String) The subject of the post as sent on the mailing list.</li>
<li><strong>Sent at</strong> (Date ISO 8601) The time of sending for the post.</li>
<li><span style="font-size:120%"></span> <strong>Sender name</strong> (String) The name of the sender of the post. Names are obfuscated, e.g. <code>HKmwHIC4dREThJRj</code>.</li>
<li><span style="font-size:120%"></span> <strong>Sender address</strong> (String) The email address of the sender of the post. Email address is obfuscated, e.g. <code>xzrEaN24LhYew151@HAYhBP6A1UVpXiHt</code>.</li>
</ul>
<h3 id="eclipse-projects-extracts">Eclipse projects extracts</h3>
<p>The <a href="../projects/eclipse_projects.html">Eclipse projects extracts</a> have different sets of data depending on the sources available for each project. We list thereafter the full list of extracts, highlighting attributes that include privacy-related information.</p>
<ul>
<li>Git (Software Configuration Management)
<ul>
<li><strong>git_commits_evol.csv</strong> contains the daily number of commits and distinct authors.</li>
<li><span style="font-size:120%"></span> <strong>git_log.txt</strong> contains the retranscription of the <code>git log</code>command, including the name and email of commit authors. Name is replaced by XXX&rsquo;s and email address is obfuscated, e.g. <code>xzrEaN24LhYew151@HAYhBP6A1UVpXiHt</code>.</li>
</ul>
</li>
<li>Bugzilla (Issue tracking)
<ul>
<li><strong>bugzilla_components.csv</strong> contains the number of issues submitted against each component.</li>
<li><strong>bugzilla_evol.csv</strong> contains the daily number of issues submitted and distinct authors.</li>
<li><span style="font-size:120%"></span> <strong>bugzilla_issues.csv</strong> contains the list of issues for the project, including the emails of the author and the assignee for each submitteed issue. Emails are obfuscated, e.g. <code>xzrEaN24LhYew151@HAYhBP6A1UVpXiHt</code>.</li>
<li><span style="font-size:120%"></span> <strong>bugzilla_issues_open.csv</strong> contains the list of open issues for the project, including the emails of the author and the assignee for each submitteed issue. Emails are obfuscated, e.g. <code>xzrEaN24LhYew151@HAYhBP6A1UVpXiHt</code>.</li>
</ul>
</li>
<li>Forums (User-oriented communication)
<ul>
<li><strong>eclipse_forums_posts.csv</strong> contains the full list of posts on the project&rsquo;s forum. It includes an Integer representation of the author of the post as returned by the API (no obfuscation needed).</li>
<li><strong>eclipse_forums_threads.csv</strong> contains the full list of posts on the project&rsquo;s forum. It includes an Integer representation of the first and last author of the thread, as returned by the API (no obfuscation needed).</li>
</ul>
</li>
<li>PMI (project metadata)
<ul>
<li><strong>eclipse_pmi_checks.csv</strong> contains a list of checks (values, usefulness, consistency) applied to the Project Management Infrastructure record for the project.</li>
</ul>
</li>
<li>SonarQube (code analysis)
<ul>
<li><strong>sq_issues_blocker.csv</strong> contains the list of SonarQube issues with severity set to blocker.</li>
<li><strong>sq_issues_blocker.csv</strong> contains the list of SonarQube issues with severity set to critical.</li>
<li><strong>sq_issues_blocker.csv</strong> contains the list of SonarQube issues with severity set to major.</li>
<li><strong>sq_metrics.csv</strong> contains the list of metrics computed by Sonarqube.</li>
</ul>
</li>
</ul>
<h2 id="anonymisation">Anonymisation</h2>
<p>The mechanism used to anonymise the data is the <a href="https://github.com/borisbaldassari/data-anonymiser">Anonymise::Utility Perl module</a>. It basically uses asymmetric encryption to generate a one-off mapping between clear IDs and obfuscated strings.</p>
<p><img src="../images/data_transformation.png" alt="Data transformation"></p>
<p>The private key is thrown away, preventing any recovering of the encrypted IDs. This technique has several advantages:</p>
<ul>
<li>Identical clear-text strings are translated to the same obfuscated string. This enables researchers and analysts to identify same occurrences of an item without any information about its actual content.</li>
<li>The private key is thrown away immediately, making it impossible for an attacker to use it to decrypt the dataset. The algorithm used is the <a href="https://metacpan.org/pod/Crypt::PK::RSA">Perl implementation of RSA</a>, which is considered reasonably strong for our purpose.</li>
<li>The public key is re-generated for each session, making it impossible for an attacker to rebuild the mapping or use rainbow tables.</li>
</ul>
<p><strong>The resulting datasets contain no email address, names, user id or machine id.</strong></p>
<h2 id="privacy-compliance">Privacy compliance</h2>
<p>The management and publication of data in the European Union is regulated by the <strong>General Data Protection Regulation</strong> (GDPR) directive, which also addresses the export of data outside the EU and EEA areas. Since we are EU citizens &ndash; and considering also that the Crossminer project is funded by the H2020 EU research program &ndash; we are to abide by this regulation. Besides the legal implications of publishing open datasets, we are willing to make sure that everybody, individuals or companies, involved in the data is safe.</p>
<p>In the case of software engineering data, there is a <a href="https://github.com/dspinellis/awesome-msr">huge amount of public information</a> readily available without any restrictions. Most, if not all, tools used in the open-source world provide information about who did what and when &ndash; which is undoubtely useful for collaboration and community. It is also mandatory regarding intellectual property processes: when one contributes a file to an open-source project, it is at the very least good practice to put her name (and maybe email address) in the header of the file along the licence used. When Intellectual Property is an important concern, like for the Eclipse Foundation, it simply is <em>required</em> since we need to know who that work belongs to in the case of IP issues and legal lawsuite cases.</p>
<p>The publication of open data in this context, i.e. with the original data being already publicly available from public tools, is a specific case of the GDPR and it is hard to find any reliable information about how it should be conducted. As a result we relied on similar studies and articles and proceeded on a best-effort basis to provide datasets to our users which are as useful and safe as possible.</p>
<p>Considering that:</p>
<ul>
<li><strong>Original data is already publicly available</strong> through the tools themselves (Git, Bugzilla, Mailing lists and forums) and their APIs.</li>
<li>We provide a <strong>complete description</strong> of the content of the datasets, <strong>identifying the risks</strong> and <strong>describing the mitigation steps</strong> we went through to ensure that the data is safe.</li>
<li>To the best of our knowledge <strong>there is now way to decrypt or reverse-engineer the obfuscated information</strong>. The method used for anonymisation is so strong that only knowing the original data could help re-identifying it.</li>
</ul>
<p>Considering also that:</p>
<ul>
<li>The goal of this processing is to provide <strong>free and open resources to help scientific research</strong>, which is in the <strong>public interest</strong> as defined in <a href="https://gdpr-info.eu/art-6-gdpr/">Article 6.1 (e)</a>.</li>
<li>The Eclipse forge hosts open source and collaborative projects only, and all contributions are made under a <strong>required signed agreement</strong> known as the <a href="https://www.eclipse.org/legal/ECA.php">Eclipse Contributor Agreement</a>: people explicitely and knowingly give their consent to make their contribution public.</li>
</ul>
<p>We assume that both the <strong>data itself and its publication are safe</strong>, regarding both the users and the current regulation.</p>
<h2 id="references">References</h2>
<ul>
<li><a href="https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&amp;from=EN">GDPR official text (HTML)</a></li>
<li><a href="https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679&amp;from=EN">GDPR official text (PDF)</a></li>
<li><a href="https://blogs.openaire.eu/?p=3248">GDPR and the research process: What you need to know</a></li>
<li><a href="https://blog.infinigate.co.uk/gdpr-personal-data-public-domain">GDPR &amp; Personal Data in the Public Domain</a></li>
<li><a href="https://www.europeandataportal.eu/en/highlights/how-address-privacy-concerns-when-opening-data">How to address privacy concerns when openin data</a></li>
</ul>
</div>
</article>
</main>
<footer id='footer' class='footer'>
<div class='container sep-before'><div class="row">
<div class="column">
<a href="http://www.eclipse.org/" target="_blank"><img src="/dataeggs/images/logo-eclipse-foundation.png" alt="Eclipse Foundation logo"></a>
</div>
<div class="column">
<p></p>
<p id="copyright">Copyright © 2021 Eclipse Foundation, Inc.<br>All Rights Reserved.</p>
</div>
</div>
<div class="row">
<p><a href="http://www.eclipse.org/legal/privacy.php" target="_blank">Privacy Policy</a> /
<a href="http://eclipse.org/" target="_blank">Eclipse</a> /
<a href="http://www.eclipse.org/legal/termsofuse.php" target="_blank">Terms of Use</a> /
<a href="http://www.eclipse.org/legal/copyright.php" target="_blank">Copyright Agent</a> /
<a href="http://www.eclipse.org/legal/" target="_blank">Legal</a> /
<a href="http://www.eclipse.org/org/foundation/contact.php" target="_blank"> Contact Us</a></p>
</div>
</div>
</footer>
</div>
</div><script>window.__assets_js_src="/dataeggs/assets/js/"</script>
<script src='/dataeggs/assets/js/main.c3bcf2df.js'></script><script src='/dataeggs/js/custom.js'></script>
</body>
</html>