USP - Universidade de São Paulo

The disintegration of the persistence of (CSAM) hashes: Review it!

The disintegration of the persistence of Memory, Salvador Dalí, Oil on Canvas

Author: Carolina Christofoletti

Link in original: Click here

In a previous article (which you can read here and here), I have started talking about a very specific dataset called by me as hash-dependent files that could be very insightful in matters of what to do with the huge amount of files that come to platforms and that are, one way or other, identified through this very same technology: hash values.

Even though I am aware that hash values are not the single technology being employed, at present, by the tech-giants in the fight against CSAM, the reason why it interests me is that they are crawling on a Trust & Safety gap. As I mentioned before, even though known CSAM hashes could and can, in fact, prevent known files from being posted on the platform, they cannot evade the other conclusion that is, simply, the fact that any hash database is crawling not only the posting section but the platform as a whole. That means, briefly, that the same technology that recognizes a criminal file trying to be uploaded to the platform recognizes the files that the platforms have been hosting, probably, for a long time.

Even if my argument today is about why platforms should prevent known hashes from being uploaded first of all (read here more about it), proposing a way to measure that that aims, as everything, to come with a proposed solution, I must do some further considerations to follow with the argument.

First, even if platforms stop letting known hashed to be uploaded to the platform, the retroactively crawled hashes will still exist. The problem here is one that CSAM files come to platforms, usually, prior to their hashing. Another point is the “who manages this database” and “with what classification”, where I would argue that 1. Known violations of platforms terms of service, even though if it does not constitute a legal violation, must be hashed. 2. Compared to the world-wide amount of CSAM circulating at present and the fact that hashes are not being extracted everywhere, the technology here helps, but it depends, at the end of the day, on the question “how complete is your database”-what does not mean that it is congruent (criminal files only).

A last one is a problem that I keep still thinking about, for it evolves an engineering complexity: collages of any kind. If you are familiar with the hash technology, you will know that this changes, completely, the hash value. If you do not live on the bubble of CSAM policies, you will but recognize that there exist, already, a place where industry is dealing very well with that: Anti-virus industry. And things may start to get more insightful if you add to the syllogism the very way in which files are stored. Having already said too much, we keep with the old hashes, bearing in mind but that there is a point to think about here above.

Known hashes. How do we know how fast platforms are finding them after the hash value enters the platform? Keep in mind that this question is a very different one from the question that asks for how fast platforms find uploaded files after recognizing their matches on the upload act. We shall not poison the derivation chain here: the fact that the platform knows where the file is changes absolutely everything.

You may argue with me: Nice, but if we take for granted how servers, apps and everything are organized since its beginning (and the benefit of non-relational databases in this field), we will conclude that this does not change anything at all, so long as things are crawlable (like the Control+F function you actually use when looking for any specific words on this text). But the crawlability of everything depends on how things are organized in the platform, and I, honestly, have no idea about how relational databases or non-relational ones distributes things in any platform.

What I know, but, is that things are measurable. And measurable in a way that they display two very different graphics. Either things work like a Control+F and hashes-matches are identified immediately, so that the graphic shows peaks or things work like an actual hash check (like your antivirus, for example) and things work in timely distribution accordingly to the partitions where they are found, so that matching are not uniformly distributed.

But analysing “time-tables” in something that forensic guys, as also anti-virus companies, have learnt a long time ago. Depending on where the malware is, it changes absolutely everything. So, the CSAM file also.

We come back them to my previous article, where I argued that there is something wrong going on with the hashes that hotlines come, sometimes, first than platforms to them. True, I do not know this metric because this data does not exist, at least publicly, anywhere. But how can we justify that? Hotlines point but the location, do not forget, while hash checks are looking for the location.

As the time gap between the day, hour, minute and second a hash value entered the database and the day, hour, minute and second its first, second, third and so on matches were found does not exist, this looks like a weird, wrong data. It is but not. It is a simple matter of incompletion that, at the end of the day, make platforms comfortable using a hash-value and… that is it! Let us focus now on fake news or whatever! (read about it here)

And in fact, things to improve still exist… if only, we accept the challenge of hearing what our neighbours are doing next door. For me, the two keys are called a) cybersecurity and b) counterterrorism. I must say it is kind of disappointing to see… how rarely a, b and c), the CSAM industry, come together… their insights are but shared.

The Hungarian hotline derived (read it here), in an organized action that deserves a real standing applause, a CSAM club in cooperation with a Counter-Terrorism agency. Keep but the hotlines in sight: The data analysis was done there. Oh yes, things derive, most probably, from a dataset that have a great potential to be seen as rubbish: the multiplied efforts of… duplicates. Watch out! AviaTor shall come here as the revolution everyone (and also researchers, as they also wait for the UK Online Safety Bill) was waiting for : An unique opportunity for a complete, big-picture global analysis. Law enforcement data is actually lost, but without industry cooperation, we fragment things also (read about it here and here).

And the hash value… where does it come from? Until today, it is used to check file integrity… in the cybersecurity industry. Watch, please, out!

To think about.