Deleting transforms and stale files


#1

Hello,
The disk space used by the file storage on our Phabricator installation has been growing a lot lately. The past days, a flood of files has appeared in bulk, which made us think of a kind of flooding attack. After investigating, all those files are transforms of image files. I have a couple questions regarding those files.

  • Why did they appear in bulk? If the transforms were not needed when the original file was uploaded, we’d rather not generate useless transforms.
  • How to delete them? We don’t really care if the transforms are missing, as long as the original files stay. We can always regenerate transforms manually if needed. It seems easy to delete the files from the database, but then the actual files in the disk storage would stay here, and we would not reclaim any disk space.
  • Follow-up question: is it possible to check that all the files on the disk are actually indexed in Phabricator? The growth of the storage folder made me a bit suspicious and I would like to make sure that no data is actually stale. Is there a way to monitor that, or to run some garbage collection? As far as I can see, the bin/files utilities only allow one to check the integrity of disk files using the database, not the other way around.

Thank you very much in advance for your input!
Itms


#2

After taking a closer look at the files taking up a lot of space, image transforms were actually not that heavy.

However, a huge lot of data in the disk-based storage was stale, i.e. there were files that were not indexed in the Phabricator database.
I deleted them with this simple script: for each file in our disk storage (with a path like /path/to/storage/35/b5/8168a5be520271f6e7601edfeb93), I ran

SELECT * FROM `file` WHERE `storageHandle` = '35/b5/8168a5be520271f6e7601edfeb93'

in the prefix_file database. The vast majority of files got zero rows as response: I deleted those files.

That script deleted 31735 files amounting to 88GB of data. (!!)

I opened a few of these deleted files, they contained parts of big text files from our repository that are frequently updated (our translation files), and I also saw parts of other big files from our repo.
Our repository is an observed SVN repository. I believe those files were created over time when the repository was updated and when commits touched the files, but Phabricator should have deleted files from the storage when the corresponding row was removed from the database. Unless the database entries were never created.

Should I open a bug report? Thank you in advance!


#3

It might be a bug, but it might also be something on your end (e.g. file permissions, manual maintenance…)
We can’t address bugs without reproduction cases, so unless you can show how a file ends up on disk without being in the DB, a bug report will not be explored. See https://phurl.io/u/bug for more details.


#4

The file permissions might be a clue: I noticed that some of the files had the phabricator user group as group owner, while others had the http server user group. Also permissions were not always the same (some had 644, others 664). I never set up anything myself in that regard, apart from making the storage directory writable. Is there something I should look into specifically?

Apart from that I can try to reproduce by creating a test instance and observing our repository on it. Can I access the file storage and the database contents on a test instance? I am seeing a Backups page on Phacility but I’m not sure how to trigger a backup.


#5

I’d guess this is probably https://secure.phabricator.com/T4752.


#6

Yes this is indeed the source of the problem. I would not have thought of looking at this, but the daemon logs indeed contain some Permission denied errors occurring during garbage collection.

I hope you manage to find some fix for this, in the meantime we will do the garbage collection manually using the script above.
Thanks for the answer!