Business Ethics & Corporate Crime Research Universidade de São Paulo
FacebookTwitterGoogle PlusYoutube

Who is winning the CSAM reporting race, tech scanners or the public? – The weird known files data

Image Retrieved from Syfy Wire

Author: Carolina Christofoletti

Link in original: Click here

According to INHOPE’s 2020 Annual Report, 60% of contents reported to their hotlines was previously known content. First question to ask here is: known CSAM content or known not CSAM content? Having a closer look at hotlines annual reports, one would rapidly conclude that hashing non-CSAM material would be an interesting policy to avoid duplication, triplication, quadruplication of hotlines efforts to reveal files that were already assessed as not CSAM. Even because, the amount of reported material is not trivial at all, and filters should be applied as a strategy of concentrating efforts.

60% is, as such, probably not the data I am looking for. Was this 60% a number representing known CSAM hashed files, then I could use this number as a representative data. Without further information about it, this 60% known files will be used just for exemplification purposes. So take it under consideration while reading this.

Considering that the “who is reporting that” varies according to each hotline (e.g. U.S.A hotlines receive a huge amount of industry reports for every business operating in the country is under the mandatory report rule, while Brazil does not know such a thing), it would be interesting to discriminate this data into a more specific dataset: How much of the public reports correspond to known CSAM files, and where are those cases concentrated, if ever?

Because the CSAM policy is known for those cases, I would go to the “social media platforms” filter. When we talk about known files, the first word that comes to my mind is hash technology. And in platforms such as Facebook, Google, YouTube and others this technology is operational at a level that, in fact, content moderators would not even have to open this file to check it again. Mathematically, it is possible to prove, with a reputable degree of certainty, that the “flagged” file is one previously inserted to the platform’s hash database.

Between the social media platforms data, I would go for the data that shows, only, known CSAM files reported by the public and found in social media platforms. Of course, the majority of my proposed known CSAM files reported by the public is expected to be concentrated in places where no such technology is applied, and this mapping would be a great argument for them to join industry efforts to hamper those illegal uploads. But, because what I want to measure now is the efficiency with each those hash policies operate, my proposal is exactly to filter this weird data: Cases where known hashes are found by hotlines (who holds the matching technology), reported by the public, as being hosted in hashed platforms (who holds the same technology). Hosted, because reports are done through URLs and if they were checked and proved, hotline personal could open it.

Though where I want to come with that may seem still a little obfuscated, I want to make clear the data visualization so to, only then, enter the technical argument: Computationally, there are two ways in which hashing algorithms could be operating, and very different results in terms of how fast all this is being done. But I will explore that in the next article.

Meanwhile, take note that: Removed files do not come to hotlines and hotlines operate with hashes also. My question here is: How many hashes files found by hotlines are hosted on hashed platforms.

But this might seem craziness. If the platforms itself is powered by the same technology as hotlines, nobody would ever report it as platforms are so fast (some of them remove the file before its first view!) in removing it that, either nobody would find the URL because it would be inexistent when found, or hotlines would never be able to open that because the file would be already gone at the time hotlines come. Really?

Initially, there is at least two hypotheses for explaining known hashes reported to hotlines:

 a)    Those known CSAM hashes found by the public are recent ones, so that even though hotlines had it on their databases, industry didn’t.

b)   Those known CSAM hashes found by the public are recent ones and that are already included in the platform’s database but, the scanning process is much slower than we think, so sometimes it takes longer to find a hash-match through technology than to have it reported by someone.

Just as a matter of pertinent observations, 60% is an absurdly high number. Originally, those known files do not come to hotlines to avoid duplication. The files are removed, and a notice of removal is shared with proper authority, where mandatory reporting is the case. If not, files will be removed, known or unknown to the existent hash lists and nor hotlines, nor any other public authority will ever know that the platform in question hosted it, except is a mirror is found. Mirrors are another fascinating topic, which deserve but a separate article.

 The Brazilian hotline receives only public reports. Though this is the ideal dataset for what I wanted to measure, the known hashes information is still missing. SafeNet Brazil, the Brazilian hotline in case, holds but this data, so a follow-up in this analysis would be possible.

Considering only the hashed platforms, and taking for granted that SaferNet Brasil receives only public reports (Facebook, Instagram, YouTube, Twitter and others are reporting to NCMEC) and that hotlines access only illegal files, and do not remove anything related to violations of platforms Terms of Service (which is their own decision), have a look now on those numbers:

SaferNet Brasil 2020 (Indicators SaferNet Brasil, Data collected in: Sunday 16th May 2020, 10 a.m. Brasília’s time) top REMOVALS:

1.     Instagram.com: 4080

2.     Twitter.com: 1442

3.     YouTube.com: 1391

4.     —: 664 (pornography website)

5.     —-: 502 (top level domain hopping)

6.     6.—-: 408 (top level domain hopping)

7.     7.facebook.com: 316

8.     8.—-: 252 (known URL from previous reports)

9.     9.—: 242 (top level domain hopping)

10.  10. —- : 195 (pornography website)

Police is also not reporting to the Brazilian hotlines, for the removal process is a court procedure here, and the notices are being sent directly by law enforcement personal.

So, public found it, SaferNet Brasil assessed it, and platforms removed it. Instagram is hashed, Twitter, Facebook and YouTube also.

Remember that Brazil has a 2008 legislation based on the United Nations Model, and that files here are still “genital focus with sexual intention” and “sexual activity with” dependent. I am not entering this realm now, but I can already say that, for the variety of materials found, this classification is insufficient. So, things removed by Brazilian legislation are expected to be, at least, something that international legislation agrees on (baselines).

What is happening to Brazil? I am afraid that Brazilians are now specialists in finding new material, but that at least some of those files that SaferNet removed are hashes, known ones.

With that, we open the road for further assessment: Filter this known hash data, take note of its location and let us go search for the next data – the time gap between hash insertion and hash finding, pointing out the cases where public arrives first. Separate this dataset.

To be continued…