Long running import from Subversion

I’ve been trying to import a large SVN repo (approx 400k commits) for the last 3 weeks. It seems to be taking an age and I can see lots of access activity at the SVN side so I know Phabricator is working it’s way through. I’ve got a few question on the import process to see if anyone else has had a similar issue or experience:

  • It seems to be tags that are causing a slow down on import where tags cover > 1000 files. I guess Phabricator goes off and looks at the contents of those files in the tag - is there a way to tell is to skip those as we’ll never review a tagging event?

  • We’ve waited 3 weeks to import stuff we’ll never be interested in reviewing as we’ll just be reviewing code change from now onwards. Is there a way just to ignore the historical stuff or tell Phabricator to stop importing? I’ve seen some documentation on forcing a repo to be marked as done but will that kill the thousands of tasks remaining in the queue/task tables?

I’d check out the help docs in ./bin/repository --help and ./bin/worker --help. Using the repository management workflow you should be able to mark the repository as imported (as you noticed already). Then, if the task queue doesn’t clear out, you can use ./bin/worker cancel (--id | --class | --min-failure-count) to cancel tasks based on one of those criteria.

Thanks @josh, I’ll check those out. I just wondered what the downsides of cancelling all the import tasks would be. Given it’s all historic data we won’t review I hoped it’d be safe to do so. We’ve a few more of a similar size to do, so don’t fancy the several weeks of import just to get started.

For info, we ended up cancelling a bunch of tasks that Phabricator couldn’t import and then forcing the repo to be imported using ./bin/repository mark-imported rXYZ. From then on all new commits were visible and it easily kept up with the ongoing activity after that.

We found when tasks were failing that the temporary files in the /tmp directory were huge (some over 500Mb in size) were filling up /tmp and failing. It’d then just go round the same loop again and again doing a retry of the task. As they were old commits we just cancelled them as we had no interest in reviewing them.

The lengthy commits were tagging commits, is there an easy way to tell Phabricator to ignore tag folders if multiple exist under the same SVN repo root? (e.g. a regex in the repo path to import that allows all paths except tag?)

There’s something called “Import Only” in the repository settings, but I don’t remember what it does for SVN.
Or maybe you’re supposed to add /trunk/ to the uri?

We’re setup in “Observe” mode so we’re tracking a separate SVN install, hence why we don’t really need the previous baggage to be installed. Adding /trunk/ should work but we have a couple of teams who have multiple /trunk/ sub-folders under the main repo root. Bad practice so we plan to try and change that but will have to try and work out an alternative in the meantime.

Our phabricator uses “Import Only” to create multiple diffussion observed svn repository imported from one trunk or branch. It works well. We only observe active svn branch or trunk.