Analysing Debian packages with Neo4j – Part 2 UDD and Graph DB Schema
In the first part of this series of articles on analyzing Debian packages with Neo4j we gave a short introduction to Debian and the life time and structure of Debian packages.
The current second part first describes the Ultimate Debian Database UDD and how to map the information presented here from the UDD into a Graph Database by developing the database scheme, that is the set of nodes and relations, together with their attributes, from the inherent properties of Debian packages.
The next part will describe how to get the data from the UDD into Neo4j, give some sample queries, and discuss further work.
The Ultimate Debian Database UDD
The Ulimate Debian Database UDD
gathers a lot of data about various aspects of Debian in the same SQL database. It allows users to easily access and combine all these data.
Data currently being imported include: Packages and Sources files, from Debian and Ubuntu, Bugs from the Debian BTS, Popularity contest, History of uploads, History of migrations to testing, Lintian, Orphaned packages, Carnivore, Debtags, Ubuntu bugs (from Launchpad), Packages in NEW queue, DDTP translations.
Debian Wiki
Collection all these information and obviously having been grown over time, the database exhibits a highly de-normalized structure with ample duplication of the same information. As a consequence, reading the SQL code fetching data from the UDD and presenting them in a coherent interface tends to be highly convoluted.
This lets us to the project of putting (parts) of the UDD into a graph database, removing all the duplication on the way and representing the connections between the entities in a natural graph way.
Developing the database schema
Recall from the first column that there are source packages and binary packages in Debian, and that the same binary package can be built in different versions from different source packages. Thus we decided to have both source and binary packages as separate entities, that is nodes of the graph, and the two being connected via a binary relation builds.
Considering dependencies between Debian packages we recall the fact that there are versioned and unversioned dependencies. We thus decide to have again different entities for versioned source and binary packages, and unversioned source and binary packages.
The above considerations leads to the following sets of nodes and relations:
vsp -[:is_instance_of]-> sp
vbp -[:is_instance_of]-> bp
sp -[:builds]-> bp
vbp -[:next]-> vbp
vsp -[:next]-> vsp
where vsp stands for versioned source package, sp for (unversioned) source package, and analog for binary packages. The versioned variants carry besides the name attribute also a version attribute in the node.
The relations are is_instance_of between versioned and unversioned packages, builds between versioned source and versioned binary packages, and next that defines an order on the versions.
An example of a simple graph for the binary package luasseq which has was originally built from the source package luasseq but was then taken over into the TeX Live packages and built from a different source.
Next we want to register suites, that is associating which package has been included in which release of Debian. Thus we add a new node type suite and a new relation contains which connects suites and versioned binary packages vbp:
Nodes of type suite only contain one attribute name. We could add release dates etc, but refrained from it for now. Adding the suites to the above diagram we obtain the following:
Next we add maintainers. The new node type mnt has two attributes: name and email. Here it would be nice to add alternative email addresses as well as alternative spellings of the name, something that is quite common. We add a relation maintains to versioned source and binary packages only since, as we have seen, the maintainership can change over the history of a package:
mnt -[:maintains]-> vsp
This leads us to the following graph:
This concludes the first (easy) part with basic node types and relations. We now turn to the more complicated part to represent dependencies between packages in the graph.
Representing dependencies
For simple dependencies (versioned or unversioned, but no alternatives) we represent the dependency relation with two attributes reltype and relversion specifying the relation type (<<, <=, ==, >=, >>) and the version as string. For unversioned relations we use the reltype=none and relversion=1:
vbp -[:depends reltype: TYPE, relversion: VERS]-> bp
Adding all the dependencies to the above graph, we obtain the following graph:
Our last step is dealing with alternative dependencies. Recall from the first blog that a relation between two Debian packages can have alternative targets like in
We treat these kind of dependencies by introducing a new node type altdep and a new relation is_satisfied_by between altdep nodes and versioned or unversioned binary packages (vbp, bp).
The following slice of our graph shows binary packages pmx which has alternative dependencies as above:
Summary of nodes, relations, and attributes
Let us summarize the node types, relations types, and respective attributes we have deduced from the requirements and data in the Debian packages:
Nodes and attributes
- mnt: name, email
- bp, sp, suite, altdeps: name
- vbp, vsp: name, version
Relations and attributes
- breaks, build_conflicts, build_conflicts_indep, build_depends, build_depends_indep, conflicts, depends, enhances, is_satisfied_by, pre_depends, provides, recommends, replaces, suggests
Attributes: reltype, relversion - builds, contains, is_instance_of, maintains, next: no attributes
Next is …
Now that we have the graph database schema set up, we need to pull data from the UDD and put them into the Graph database. This will be discussed in the next entry of this series.