Debian packages as Graph Database
I have been playing around with Neo4j as graph database, and searching for a big dataset I decided to look at Debian packages (source and binary) from stable, testing, sid, and experimental, and represent all of that in a big graph database.
While this is far from ready, the following entities and relations a represented:
- source packages, unversioned and versioned
- binary packages, unversioned and versioned
- maintainers
- all dependencies, including alternatives and versioned dependencies
- relations like maintains, builds, etc
- suites (stable, testing, sid, experimental)
The graph currently has 220618 nodes and 782323 edges, and my first trial to import this into the database was by generating a long cypher statement, and then throwing that at cypher-shell. Well, that was not the best idea. After 24h I stopped the process and rewrote the generation script to generate csv files. Using neo4j-import the same amount of data was imported in 5secs (!!!).
What I would like to get in the future is the whole package history as well, and maybe also include all the bugs into the database … if I only would have easily accessible and parseable information about these items (Debian Q&A maybe?). If you have any suggestions, please let me know.
More to come, stay tuned.
Nice! As for the additional details you’d like to include, perhaps you can consult the Ultimate Debian Database.
Hi Aaron,
Thanks for your suggestion, I stumbled over udd recently and have already rewritten the scripts to use data from the udd 😉 and I want to include more and more parts of it over time. What is a pain is the not very nice output formats of psql, the unaligned output format is the best, but it seems there is no quoting our so possible to escape multi line fields.