These are the words of a madman, not necessarily true nor possible.

Useful SPARQL concepts

Endpoint

individual service able to reply to SPARQL queries (eg. tracker-store, or https://query.wikidata.org/)

GRAPH

Individual collections of RDF triples

https://www.w3.org/TR/sparql11-query/#rdfDataset

DESCRIBE/CONSTRUCT

Query syntax to generate RDF data out of a dataset

https://www.w3.org/TR/sparql11-query/#describe

https://www.w3.org/TR/sparql11-query/#construct

LOAD

Update syntax to incorporate external resources (eg. RDF) into a graph in the dataset

https://www.w3.org/TR/sparql11-update/#load

SERVICE

Syntax to distribute queries across SPARQL endpoints and merge the results

https://www.w3.org/TR/2013/REC-sparql11-federated-query-20130321/#introduction

Concepts to explore

DESCRIBE/CONSTRUCT

DESCRIBE/CONSTRUCT at large scale are reasonably easy now that tracker supports unrestricted queries

GRAPH

Tracker has very rudimentary support for graphs:

  • No two graphs may have the same triple (cardinality is global)
  • Unique indices are global too
  • FROM/FROM NAMED/GRAPH syntax aren't entirely right

At the heart of all this is the approach to store graph data in the database, every property has an additional *:graph column, but data from all graphs is actually merged in the same tables under the same restrictions.

Graphs may be generally considered isolated units, a more 1:1 approach would consist of having graphs be stored in individual databases, that may be later merged together by the engine (eg. through https://www.sqlite.org/unionvtab.html). The CLEAR/CREATE/DROP/COPY/MOVE/ADD additional graph management syntax from sparql1.1 might quickly fall in place with this.

Caveats/pitfalls

  • resource IDs do need to be global, tracker reset must consider.

  • sqlite limitations apply, point 11 in https://sqlite.org/limits.html is relevant to this approach.

LOAD

We have most of the pieces to implement LOAD, as we already have a tracker-store DBus method that pretty much does this. Basically it turns into a language feature then. However, it might benefit from graphs as described above.

SERVICE

SERVICE might be possible to implement through a virtual table (https://sqlite.org/vtab.html), Tracker roughly provides this functionality through tracker_sparql_connection_remote_new(), although that connects to an specific endpoint instead of blending it into the query.

Caveats/pitfalls

  • Virtual tables have a fixed set of columns set at construction, might require some JIT/dynamic management of tables in TEMP/MEMORY
  • Partially resolving the local query in order to produce the most optimized remote query (eg. provide values/ranges) seems hard. Just not doing that and letting sqlite handle it all through the virtual table sounds feasible, but slow.

Piecing it together

Backups

An application might be able to do:

  DESCRIBE ?u
  WHERE {
    ?u a nmm:Photo ;
       nfo:belongsToContainer/nie:url 'file:///run/media...'
  }
  • And serialize the results into a file, which might then be loaded through:

  LOAD SILENT <file:///...>

This essentially supersedes tracker_sparql_connection_load().

Sandboxing (Option 1)

Built upon graphs as individual databases. Those can be selectively exposed into the sandbox FS.

Pros

  • Allows direct readonly access within the sandbox
  • Single tracker-store, outside the sandbox
  • Minimal changes to sparql around

Cons

  • All updates still have to happen through DBus
  • Beware of limits on the number of attached databases

???

  • Miners stay in the host
  • Data isolation comes from miners, eg. music and photos would get distinct graphs, and applications would request access to those.

Sandboxing (Option 1.5)

On top of the previous option, we could make a TrackerSparqlConnection that has a private writable store (like tracker_sparql_connection_local_new), but can get readonly access to the global store.

Pros

  • Allows direct readonly access within the sandbox
  • Updates happen to the local private store, within the sandbox. The host data cannot be changed.
  • Minimal changes to sparql around
  • tracker-extract might move within the sandbox

Cons

  • Every graph must still follow the same ontology
  • If host data is deleted (eg. tracker reset), the private database cannot be expected to be coherent.
  • Beware of limits on the number of attached databases

???

  • Data isolation comes from miners, eg. music and photos would get distinct graphs, and applications would request access to those.

Sandboxing (Option 2)

Built upon SERVICE. tracker clients get a local store, queries across endpoints are done through SERVICE, eg:

  SELECT ?a ?url ?d {
    SERVICE <dbus://org.freedesktop.Tracker.Miner.FS> {
      ?u a nmm:Photo ;
         nie:url ?url
    } .
    ?a foo:url ?url ;
       foo:data ?d
  }

Optionally clients might export themselves over DBus as a sparql endpoint, able to be queried on the outside, eg an hypothetical global search might do:

  SELECT ?url {
    SERVICE <dbus://org.gnome.Music> {
      ?song nie:url ?url .
            fts:match "term"
    }UNION SERVICE <dbus://org.gnome.Photos> {
      ?photo nie:url ?url .
             fts:match "term"
    }
  }

Data becomes fully distributed (SPARQL's vision).

Pros

  • Full freedom wrt ontologies, the sandbox application might have a custom ontologies and data, meshed together with tracker miners' nepomuk
  • Updates are all kept within the sandbox, remote endpoints being readonly happen naturally from the sparql syntax.

Cons

  • Settles on DBus for IPC with any other endpoint. Direct access is not as straightforward.
  • Heavier sparql changes involved
  • Although graphs might still be used to split data, access control might be left up to the dbus layer
  • Needs some care to avoid breaking out into other endpoints from an authorized one, eg.

    SELECT * {
      SERVICE <dbus://org.freedesktop.Tracker.Miner.FS> {
        SERVICE <dbus://org.gnome.Photos> {
        }
      }
    }

???

  • Although tracker-extract data might be within the sandbox, that would effectively lock the client on nepomuk ontology.

Discussion

Attic/Tracker/FuturePlans (last edited 2023-08-14 12:50:30 by CarlosGarnacho)