Methodological considerations about the universal CSAM hash database as a training set
Author: Carolina Christofoletti
Link in original: Click here
Like steganography, the fact that we cannot see what the message encoded between the lines says does not mean that there is no message at all. And my effort with you today is trying to read, once again, on the encoded lines of the so-called known Child Sexual Abuse Material (CSAM) hash metrics, turning metrics into policies.
And, in this occasion, I would like to keep expanding a little more the “big picture” of everything where those hash-metrics are located to reach but my sample on the hash-powered industry as a whole.
This is not a single piece, but rather part of a sequence of articles that aims, keeping with the end finish line always in the horizon, to identify how new CSAM files “travel” imperceptibly through very different social media and in which degree this derives from a “map and remove view of everything” where decomposition is blamed as only another methodological complication.
Important is but that hashed CSAM files are removed, and the database is updated. And the new files… well, the new files are a problem. That is what industry says. Mirrors have always their blind spots, and mirrored strategies also. I will decompose things for you so, working where things stop, at the new CSAM files, but at their “known CSAM hashed material” scars.
Provided that we want hash and removal metrics not only to serve an argument for platforms who claim that they are trying harder than ever to deal with those illegal materials – even if with research points transparent as those I am about to show you remain untouched in the rhetorical paradise-, but mainly to push improved (algorithmically) policies in real world, we need to decompose that. Here we go.
· Methodological Fallacy?
Have you ever asked yourself how artificial intelligence is trained, for CSAM detection purposes, in online platforms? If so, have you ever come to any answer that looked like, somehow, “with known CSAM (mixed) databases” or “with adult pornography”? If so, keep track of that, because this is where things stuck.
At present, I see two problems with the by-AI-identifiable-so-reliable metrics: The hash-dependent files are being ignored and platforms seem to use a common strategy without taking into account what their “settings deviations” are.
And the first premise here is that, if there is such a thing as a hash-based technology used as a compliance standard (be that questionable or not, it is a fact) for industry on the fight against CSAM, that means, necessarily, that materials not only reappear somewhere else, but that the data (hash-value) collected in one online platform can, for this very fact of being collected somewhere else, validate the standard somewhere else.
Makes sense, doesn’t it? One wants CSAM material coming from Facebook’s to be identifiable by YouTube. True. But we also want things to be more agile, so that YouTube does not turn to be dependent of a) this file appearing on Facebook to identify it on YouTube and of b) the ability of Facebook to find it. Highlight this. We do not want new CSAM material to be hash dependent (read about hash dependent files here).
Even though the “reappearance” is a common fact, the environmental condition in which that happens is not. Keep track and follow the argument.
· External-intelligence dependent data (or hash dependent files)
I have already mentioned somewhere else (read it here) that there is a point where known and new CSAM hashes converge. This very point (point A, call it), even though appearing only in a single time (first hash check), is a graphically translucent representation of where platforms policies have failed. If we are working with AI identifiable data, this is our dataset.
In retrospect, every time Facebook, for example, goes through its first hash checks, we have a dataset called old-new hashed CSAM material. That is, CSAM files that survived out of sight in the platform until the time they were hashed. Transparency report readers, keep track of that because it poisons the metrics also (you can read more about how to read those reports here). The number of AI found matches may vary, in the hashed material case, with the size of the database and the number of viral files in that condition.
And maybe, the reason why platforms do not distinguish between removed after hashed and removed after AI identification is because readers might be scared with what they would discover with that data. By logic, hashed files are always older than their AI identified colleagues, since hashes tend to come faster as policy changes in AI metrics.
· External Dataset Decomposition
Mixed databases are a bias for algorithmic training if those are the single AI algorithm in place. There must be two, the “baseline” one, trained with known CSAM files and the “deviation one”, trained with CSAM hashes from platform-specific failed policies (meaning what AI and hash-system has missed). Policies should then derive from this data cross.
And we must segment, as a matter of keeping track of where the poisoning occurs, what is newly hashed CSAM material coming from Facebook alerts to what is newly YouTube CSAM hashed material and so on, even though Facebook and YouTube share the same CSAM hash database.
More than AI identified files, what seems to be still more interesting is the by-the-public identified, and later hashed, CSAM files (you can read more about it here). We also need them to be platform specific. We may expect that found-by-the-public CSAM dataset varies, greatly, among different platforms. After all, as long as online platforms keep contributing with different hashes to the CSAM hash databases, they are finding new, different things.
Public is, at the end of the day, who do the error control in those algorithms, pointing out where they fail. And for that reason, they belong (or should belong) to a separate dataset named “failing points”.
It does not mean that a homogeneous algorithm is not possible, but that one must keep a very special track of errors reports, named (as it could be no other) public reports. Translating, those often ignored 1% removed CSAM files no one talks about and that are correction metrics.
· Internal Dataset Decomposition
Point one is, so, that one must keep track of where this material was first found and if Artificial Intelligence (AI) found it or public reports. The reason of that is, translating in more simple words, that if Facebook hosted a material (later found by a hash checks coming from public report) first found by YouTube’s AI policies, Facebook AI system missed a feature that YouTube has already identified.
But industry share, most of the time, common API’s and who knows if not common AI technology. True. And you poison blind points here just like you poison a generalizing argument without further comparatives. If everyone is using the same AI algorithm, the conclusions can be only relative. Of course, that, if all one has is hash-policies, platforms with brand-new CSAM material will look like “safer” than their hash-hosters colleagues. When you discover that, numbers lose their original values, as new material keeps being existent….somewhere in a stenography hidden cypher.
· Trends
Maybe, new material has the same characteristic as old, known hashed CSAM material and the previous dataset suffices. Maybe, but, things change accordingly to trends that are obfuscated in giant datasets where they lose, at the end of the day, their immediate intervention power. If, in the last 3 months, a very special CSAM category has been reported to be found and hashed, 3 months of image is still too little in hash dataset history. At the time AI system get to know the trend, the trend is already gone. And here is the importance of observing variation in real time.
Sharing with you some initial methodological research design insights, to see how we guarantee that the conclusion we are looking for is not only unbiased but methodologically not poisoned.
For we to start… Thinking about it! Keep the horizon in sight.