USP - Universidade de São Paulo

Mask ball: Our platform is free of CSAM, we use known hash lists. Really?

Image Retrieved from: Public Domain Pictures

Author: Carolina Christofoletti

Link in original: Click here

From a compliance point of view and as an advocate on the fight against Child Sexual Abuse Material (CSAM) cause, every time I read something such as “Our platform is protected against CSAM. We use hash technology to detect it”, I scroll down to see how many pages are there in the document I am reading. Normally, it ends 2 or 3 paragraphs latter. Hash policy is detailed a little further and then… that is it.

The argument is fallacious and, if you do not have a specialized compliance officer looking at this specific point of your Policies, he will see this risk as controlled and move to the next analysis. Being part of a hash list is more than sufficient for most compliance officers. Not for me, and I will tell you why and, in case you are a businessman reading that and want to analyse in matters of dollars, see it as a way of measuring the legal risk of your company.

Do not forget that, even if at the time the file was upload its hash value was unknown, the platform had the duty of addressing, with the proper due diligence (which needs a previous pattern mapping), this problem. With those initial considerations done, let us go to the metrics point.

From a computation point of view, there are two possible ways that CSAM hash technology could be working:

1) “Like a Control+F” function, meaning that every time a new hash enters the database the platform is being scanned to search for copies of this very file, with results being presented automatically.

2) “Like an antivirus”, meaning that every time a new hash enters the database the platform is being scanned bit by bit to search for copies of this very file, where the velocity with which files are being founded depends on how fast the algorithm can come to the place where the file was located.

Without entering on technical details of that (which will depend, to summarize, on the computational architecture of all that), the fact is that every time a new hash enters the platform, the algorithm is expected to find at some time, those very same files hosted somewhere in the platform.

And from this point on, data such as “number of known hash matches”, “number of removed files” are meaningless from a policy perspective. They are incomplete.

I will show you, later in the next articles, how to measure that. Meanwhile, you may agree with me that there is a huge difference if:

a) When new hashes enter a database, platforms are not finding any matches. Matches only come later, when somebody tries (and at present time, succeed in) to upload it.

b) When new hashes enter a database, platforms are finding many matches. That means that, wasn’t it for the hashes, the files would have hardly been found. If this number is very high, platforms safety is threatened and, apart from hashes, they must come with an integrated, parallel solution to support hashes.

You may agree with me, also, that this will show how hash dependent and how “safe” platforms are. The best data practice would be, so, to make data say what matters.

Every time a platform lets a known hash to enter its platform, even if with the only purpose of removing it a second later, platforms are creating the data chaos: The data that says how many times people tried and could upload that file after the hash database was integrated with a specific hash is mixed with the data that says how many times this file was uploaded to the platform before the platform had any chance of identifying it. First objection.

A file that was uploaded 10980 times prior to the very day that its hash came to be part of the platform code is much more hash dependent than another one that was uploaded only a single time. Do not forget, the time a hash match is found, the file is still existent in the platform. It could be no other. For terminology purposes, every time a hash founds a match in its platform, this hash is to be called a CSAM hash from failed policies.

That means that, only when the hash came, its existence came to light. And this data is, at present, still hidden. We do not have any idea about what its numerical value is, and what percentage of the overall problem do they represent.

How long does it take between the first upload of a file and its hash value incorporation by the platform? This data shall also be existent if the command there is: Identify, index metadata (ex. Day of posting and year) and remove.

If platforms want to spread any message in a criminal be, it should not be: “Whoever upload known files here will be reported to NCMEC” but “uploading any CSAM file here is technically impossible”.

Compliance policies must be read as a whole. And if a platform, if any, writes in its CSAM policy that they are using known hashes to identify it, they must be aware that the first evasion technic is called new material. From a Policy perspective, announcing this technology could have a very negative impact on children themselves, as production of new material is of criminal’s interest. Never, ever, should removal be made the state of arts of current CSAM policies: Criminals know there what was known or unknown, depending on how fast those technologies work. I reaffirm what I said before: The best police would be, in my opinion, prohibition of upload.

First conclusion is that from the time a new hash is added to the platform, would need to fragment your dataset in two: What was already there, and what people are trying to upload, up to now.

Second conclusion, so, is that the number of known hash matches is an incredibly informative data for platforms, from a Trust & Safety point of view. At some point, after all “disk” was scanned, platform become “free” from that very file, and the data that will be shown there will show how high is the trust, among criminals, relating to that very file, in the sense that they could go undetected. As removals start, trust will be slowly mined. Removals are, as I mentioned in another article, not the best solution

But, mathematically, there is a variance that cannot be discharged: How long does a hash check with platform’s already existent data take. I will address this in the next article.

Keep reading..