Querying Distributed RDF Data Sources with SPARQL

From Openresearch
Jump to: navigation, search
Querying Distributed RDF Data Sources with SPARQL
Querying Distributed RDF Data Sources with SPARQL
Bibliographical Metadata
Subject: Querying Distributed RDF Data Sources
Year: 2008
Authors: Bastian Quilitz, Ulf Leser
Venue ESWC
Content Metadata
Problem: SPARQL Query Federation
Approach: decompose a query into sub-queries, each of which can be answered by an individual service.
Implementation: DARQ
Evaluation: Evaluate the performance of the DARQ query engine.

Abstract

DARQ provides transparent query access to multiple SPARQL services, i.e., it gives the user the impression to query one single RDF graph despite the real data being distributed on the web. A service description language enables the query engine to decompose a query into sub-queries, each of which can be answered by an individual service. DARQ also uses query rewriting and cost-based query optimization to speed-up query execution.

Conclusion

DARQ offers a single interface for querying multiple, distributed SPARQL end-points and makes query federation transparent to the client. One key feature of DARQ is that it solely relies on the SPARQL standard and therefore is compatible to any SPARQL endpoint implementing this standard. Using service descriptions provides a powerful way to dynamically add and remove endpoints to the query engine in a manner that is completely transparent to the user. To reduce execution costs we introduced basic query optimization for SPARQL queries. Our experiments show that the optimization algorithm can drastically improve query performance and allow distributed answering of SPARQL queries over distributed sources in reasonable time. Because the algorithm only relies on a very small amount of statistical information we expect that further improvements are possible using techniques. An important issue when dealing with data from multiple data sources are differences in the used vocabularies and the representation of information. In further work, we plan to work on mapping and translation rules between the vocabularies used by different SPARQL endpoints. Also, we will investigate generalizing the query patterns that can be handled and blank nodes and identity relationships across graphs.

Future work

In further work, we plan to work on mapping and translation rules between the vocabularies used by different SPARQL endpoints. Also, we will investigate generalizing the query patterns that can be handled and blank nodes and identity relationships across graphs.

Approach

Positive Aspects: Query rewriting and cost-based query optimization to speed-up query execution.

Negative Aspects: {{{NegativeAspects}}}

Limitations: {{{Limitations}}}

Challenges: {{{Challenges}}}

Proposes Algorithm: {{{ProposesAlgorithm}}}

Methodology: {{{Methodology}}}

Requirements: {{{Requirements}}}

Limitations: {{{Limitations}}}

Implementations

Download-page: http://darq.sf.net/

Access API: {{{API}}}

Information Representation: RDF

Data Catalogue: Service Description

Runs on OS: Linux SunOS 5.10

Vendor: Open Source

Uses Framework: ARQ

Has Documentation URL: http://darq.sf.net/

Programming Language: Java

Version: 1.0

Platform: Jena

Toolbox: No data available now.

GUI: No

Research Problem

Subproblem of: Querying Distributed RDF Data Sources

RelatedProblem: transparent query federation

Motivation: {{{Motivation}}}

Evaluation

Experiment Setup: we split all data over two Sun-Fire-880 machines (8x sparcv9 CPU, 1050Mhz, 16GB RAM) running SunOS 5.10. The SPARQL endpoints were provided using Virtuoso Server 5.0.37 with an allowed memory usage of 8GB . Note that, although we use only two physical servers, there were five logical SPARQL endpoints. DARQ was running on Sun Java 1.6.0 on a Linux system with Intel Core Duo CPUs, 2.13 GHz and 4GB RAM. The machines were connected over a standard 100Mbit network connection.

Evaluation Method : evaluate the performance of the DARQ query engine.

Hypothesis: -

Description: In this section we evaluate the performance of the DARQ query engine. The prototype was implemented in Java as an extension to ARQ5. We used a subset of DBpedia6. DBpedia contains RDF information extracted from Wikipedia. The dataset is offered in different parts.

Dimensions: Performance

Benchmark used: subset of DBpedia.

Results: The experiments show that our optimizations significantly improve query evaluation performance. For query Q1 the execution times of optimized and unoptimized execution are almost the same. This is due to the fact that the query plans for both cases are the same and bind joins of all sub-queries in order of appearance is exact the right strategy. For queries Q2 and Q4 the unoptimized queries took longer than 10 min to answer and timed out, whereas the execution time of the optimized queries is quiet reasonable. The optimized execution of Q1 and Q2 takes almost the same time because Q2 is rewritten into Q1.

Access API{{{API}}} +
Event in seriesESWC +
Has BenchmarkSubset of DBpedia. +
Has Challenges{{{Challenges}}} +
Has DataCatalougeService Description +
Has DescriptionIn this section we evaluate the performanc
In this section we evaluate the performance of the DARQ query engine. The

prototype was implemented in Java as an extension to ARQ5. We used a subset of DBpedia6. DBpedia contains RDF information extracted from Wikipedia.

The dataset is offered in different parts.
The dataset is offered in different parts. +
Has DimensionsPerformance +
Has DocumentationURLhttp://darq.sf.net/ +
Has Downloadpagehttp://darq.sf.net/ +
Has EvaluationEvaluate the performance of the DARQ query engine. +
Has EvaluationMethodevaluate the performance of the DARQ query engine. +
Has ExperimentSetupwe split all data over two Sun-Fire-880 ma
we

split all data over two Sun-Fire-880 machines (8x sparcv9 CPU, 1050Mhz, 16GB RAM) running SunOS 5.10. The SPARQL endpoints were provided using Virtuoso Server 5.0.37 with an allowed memory usage of 8GB . Note that, although we use only two physical servers, there were five logical SPARQL endpoints. DARQ was running on Sun Java 1.6.0 on a Linux system with Intel Core Duo CPUs, 2.13 GHz and 4GB RAM. The machines were connected over a standard

100Mbit network connection.
ver a standard 100Mbit network connection. +
Has GUINo +
Has Hypothesis- +
Has ImplementationDARQ +
Has InfoRepresentationRDF +
Has Limitations{{{Limitations}}} +
Has NegativeAspects{{{NegativeAspects}}} +
Has PositiveAspectsQuery rewriting and cost-based query optimization to speed-up query execution. +
Has Requirements{{{Requirements}}} +
Has ResultsThe experiments show that our optimization
The experiments show that

our optimizations significantly improve query evaluation performance. For query Q1 the execution times of optimized and unoptimized execution are almost the same. This is due to the fact that the query plans for both cases are the same and bind joins of all sub-queries in order of appearance is exact the right strategy. For queries Q2 and Q4 the unoptimized queries took longer than 10 min to answer and timed out, whereas the execution time of the optimized queries is quiet reasonable. The optimized execution of Q1 and Q2 takes almost the same time

because Q2 is rewritten into Q1.
same time because Q2 is rewritten into Q1. +
Has SubproblemQuerying Distributed RDF Data Sources +
Has Version1.0 +
Has abstractDARQ provides transparent query access to
DARQ provides transparent query access to multiple SPARQL services, i.e., it gives the user the impression to query one single RDF graph despite the real data being distributed on the web. A service description language enables the query engine to decompose a query into sub-queries, each of which can be answered by an individual service. DARQ also uses query rewriting and cost-based query optimization to speed-up query execution.
optimization to speed-up query execution. +
Has approachdecompose a query into sub-queries, each of which can be answered by an individual service. +
Has authorsBastian Quilitz + and Ulf Leser +
Has conclusionDARQ offers a single interface for queryin
DARQ offers a single interface for querying multiple, distributed SPARQL end-points and makes query federation transparent to the client. One key feature of DARQ is that it solely relies on the SPARQL standard and therefore is compatible to any SPARQL endpoint implementing this standard. Using service descriptions provides a powerful way to dynamically add and remove endpoints to the query engine in a manner that is completely transparent to the user. To reduce execution costs we introduced basic query optimization for SPARQL queries. Our experiments show that the optimization algorithm can drastically improve query performance and allow distributed answering of SPARQL queries over distributed sources in reasonable time. Because the algorithm only relies on a very small amount of statistical information we expect that further improvements are possible using techniques. An important issue when dealing with data from multiple data sources are differences in the used vocabularies and the representation of information. In further work, we plan to work on mapping and translation rules between the vocabularies used by different SPARQL endpoints. Also, we will investigate generalizing the query patterns that can be handled and blank nodes and identity relationships across graphs.
and identity relationships across graphs. +
Has future workIn further work, we plan to work on mappin
In further work, we plan to work on mapping and translation rules between the vocabularies used by different SPARQL endpoints. Also, we will investigate generalizing the query patterns that can be handled and blank nodes and identity relationships across graphs.
and identity relationships across graphs. +
Has motivation{{{Motivation}}} +
Has platformJena +
Has problemSPARQL Query Federation +
Has relatedProblemTransparent query federation +
Has subjectQuerying Distributed RDF Data Sources +
Has vendorOpen Source +
Has year2008 +
ImplementedIn ProgLangJava +
Proposes Algorithm{{{ProposesAlgorithm}}} +
RunsOn OSLinux SunOS 5.10 +
TitleQuerying Distributed RDF Data Sources with SPARQL +
Uses FrameworkARQ +
Uses Methodology{{{Methodology}}} +
Uses ToolboxNo data available now. +