FedX: Optimization Techniques for Federated Query Processing on Linked Data

From Openresearch
Jump to: navigation, search
FedX: Optimization Techniques for Federated Query Processing on Linked Data
FedX: Optimization Techniques for Federated Query Processing on Linked Data
Bibliographical Metadata
Subject: Querying Distributed RDF Data Sources
Keywords: Not available
Year: 2011
Authors: Andreas Schwarte, Peter Haase, Katja Hose, Ralf Schenkel, Michael Schmidt
Venue ISWC
Content Metadata
Problem: SPARQL Query Federation
Approach: Join processing and optimization approaches
Implementation: FedX
Evaluation: The practicability and efficiency of FedX

Abstract

Motivated by the ongoing success of Linked Data and the growing amount of semantic data sources available on the Web, new challenges to query processing are emerging. Especially in distributed settings that require joining data provided by multiple sources, sophisticated optimization techniques are necessary for efficient query processing. We propose novel join processing and grouping techniques to minimize the number of remote requests, and develop an effective solution for source selection in the absence of preprocessed metadata. We present FedX, a practical framework that enables efficient SPARQL query processing on heterogeneous, virtually integrated Linked Data sources. In experiments, we demonstrate the practicability and efficiency of our framework on a set of real-world queries and data sources from the Linked Open Data cloud. With FedX we achieve a significant improvement in query performance over state-of-the-art federated query engines.

Conclusion

In this paper, we proposed novel optimization techniques for efficient SPARQL query processing in the federated setting. As revealed by our benchmarks, bound joins combined with our grouping and source selection approaches are effective in terms of performance. By minimizing the number of intermediate requests, we are able to improve query performance significantly compared to state-of-the-art systems. We presented FedX, a practical solution that allows for querying multiple distributed Linked Data sources as if the data resides in a virtually integrated RDF graph. Compatible with the SPARQL 1.0 query language, our framework allows clients to integrate available SPARQL endpoints on-demand into a federation without any local preprocessing. While we focused on optimization techniques for conjunctive queries, namely basic graph patterns (BGPs), there is additional potential in developing novel, operator-specific optimization techniques for distributed settings (in particular for OPTIONAL queries), which we are planning to address in future work. As our experiments confirm, the optimization of BGPs alone (combined with common equivalent rewritings) already yields significant performance gains. Important features for federated query processing are the federation extensions proposed for the upcoming SPARQL 1.1 language definition. These allow to specify data sources directly within the query using the SERVICE operator, and moreover to attach mappings to the query as data using the BINDINGS operator. When implementing the SPARQL 1.1 federation extensions for our next release,FedX can exploit these language features to further improve performance. In fact, the SPARQL 1.1 SERVICE keyword is a trivial extension, which enhances our source selection approach with possibilities for manual specification of new sources and gives the query designer more control. Statistics can in uence performance tremendously in a distributed setting. Currently, FedX does not use any local statistics since we follow the design goal of on-demand federation setup. We aim at providing a federation framework, in which data sources can be integrated ad-hoc, and used immediately for query processing. In a future release, (remote) statistics (e.g., using VoID ) can be incorporated for source selection and to further improve our join order algorithm.

Future work

While we focused on optimization techniques for conjunctive queries, namely basic graph patterns (BGPs), there is additional potential in developing novel, operator-specific optimization techniques for distributed settings (in particular for OPTIONAL queries), which we are planning to address in future work. In a future release, (remote) statistics (e.g., using VoID) can be incorporated for source selection and to further improve our join order algorithm.

Approach

Positive Aspects: {{{PositiveAspects}}}

Negative Aspects: {{{NegativeAspects}}}

Limitations: {{{Limitations}}}

Challenges: {{{Challenges}}}

Proposes Algorithm: {{{ProposesAlgorithm}}}

Methodology: {{{Methodology}}}

Requirements: {{{Requirements}}}

Limitations: {{{Limitations}}}

Implementations

Download-page: http://www.uidops.com/FedX

Access API: No data available now.

Information Representation: RDF

Data Catalogue: -

Runs on OS: Windows 2008 Server 64bit

Vendor: No data available now.

Uses Framework: No data available now.

Has Documentation URL: http://www.uidops.com/FedX

Programming Language: Java

Version: 1.0

Platform: Sesame

Toolbox: No data available now.

GUI: Yes

Research Problem

Subproblem of: Federated RDF query processing

RelatedProblem: Find optimization techniques that allow for efficient SPARQL query processing on federated Linked Data

Motivation: The lack of a global schema, querying data from multiple sources could be solved by querying distributed datasets

Evaluation

Experiment Setup: All experiments are carried out on an HP Proliant DL360 G6 with 2GHz 4Core CPU with 128KB L1 Cache, 1024KB L2 Cache, 4096KB L3 Cache, 32GB 1333MHz RAM, and a 160 GB SCSI hard drive. In all scenarios we assigned 20GB RAM to the process executing the query In the SPARQL federation we additionally assign 1GB RAM to each individual SPARQL endpoint process.

Evaluation Method : Compare the results to state-of-the-art federated query processing engines.

Hypothesis: -

Description: [[Has Description::we evaluate FedX and analyze the performance of our optimiza- tion techniques. With the goal of assessing the practicability of our system, we run various benchmarks and compare the results to state-of-the-art federated query processing engines. In our benchmark, we compare the performance of FedX with the competitive systems DARQ and AliBaba7 since these are compa- rable to FedX in terms of functionality and the implemented query processing approach. Unfortunately, we were not able to obtain a prototype of the system presented in [2] for comparison.]]

Dimensions: Performance

Benchmark used: FedBench project page: http://code.google.com/p/fbench/http://www.openrdf.org/http://www4.wiwiss.fu-berlin.de/drugbank/http://kegg.bio2rdf.org/sparql

Results: With our optimization techniques, we are able to reduce the number of requests significantly, e.g., from 170,579 (DARQ) and 93,248 (AliBaba) to just 23 (FedX) for query CD3.

Access APINo data available now. +
Event in seriesISWC +
Has BenchmarkFedBench project page: http://code.google.com/p/fbench/http://www.openrdf.org/http://www4.wiwiss.fu-berlin.de/drugbank/http://kegg.bio2rdf.org/sparql +
Has Challenges{{{Challenges}}} +
Has DataCatalouge- +
Has DimensionsPerformance +
Has DocumentationURLhttp://www.uidops.com/FedX +
Has Downloadpagehttp://www.uidops.com/FedX +
Has EvaluationThe practicability and efficiency of FedX +
Has EvaluationMethodCompare the results to state-of-the-art federated query processing engines. +
Has ExperimentSetupAll experiments are carried out on an HP P
All experiments are carried out on an HP Proliant DL360 G6 with 2GHz

4Core CPU with 128KB L1 Cache, 1024KB L2 Cache, 4096KB L3 Cache, 32GB 1333MHz RAM, and a 160 GB SCSI hard drive. In all scenarios we assigned 20GB RAM to the process executing the query

In the SPARQL federation we additionally assign 1GB RAM to each individual SPARQL endpoint process.
o each individual SPARQL endpoint process. +
Has GUIYes +
Has Hypothesis- +
Has ImplementationFedX +
Has InfoRepresentationRDF +
Has Limitations{{{Limitations}}} +
Has NegativeAspects{{{NegativeAspects}}} +
Has PositiveAspects{{{PositiveAspects}}} +
Has Requirements{{{Requirements}}} +
Has ResultsWith our optimization techniques, we are able to reduce the number of requests significantly, e.g., from 170,579 (DARQ) and 93,248 (AliBaba) to just 23 (FedX) for query CD3. +
Has SubproblemFederated RDF query processing +
Has Version1.0 +
Has abstractMotivated by the ongoing success of Linked
Motivated by the ongoing success of Linked Data and the growing amount of semantic data sources available on the Web, new challenges to query processing are emerging. Especially in distributed settings that require joining data provided by multiple sources, sophisticated optimization techniques are necessary for efficient query processing. We propose novel join processing and grouping techniques to minimize the number of remote requests, and develop an effective solution for source selection in the absence of preprocessed metadata. We present FedX, a practical framework that enables efficient SPARQL query processing on heterogeneous, virtually integrated Linked Data sources. In experiments, we demonstrate the practicability and efficiency of our framework on a set of real-world queries and data sources from the Linked Open Data cloud. With FedX we achieve a significant improvement in query performance over state-of-the-art federated query engines.
state-of-the-art federated query engines. +
Has approachJoin processing and optimization approaches +
Has authorsAndreas Schwarte +, Peter Haase +, Katja Hose +, Ralf Schenkel + and Michael Schmidt +
Has conclusionIn this paper, we proposed novel optimizat
In this paper, we proposed novel optimization techniques for efficient SPARQL query processing in the federated setting. As revealed by our benchmarks, bound joins combined with our grouping and source selection approaches are effective in terms of performance. By minimizing the number of intermediate requests, we are able to improve query performance significantly compared to state-of-the-art systems. We presented FedX, a practical solution that allows for querying multiple distributed Linked Data sources as if the data resides in a virtually integrated RDF graph. Compatible with the SPARQL 1.0 query language, our framework allows clients to integrate available SPARQL endpoints on-demand into a federation without any local preprocessing. While we focused on optimization techniques for conjunctive queries, namely basic graph patterns (BGPs), there is additional potential in developing novel, operator-specific optimization techniques for distributed settings (in particular for OPTIONAL queries), which we are planning to address in future work. As our experiments confirm, the optimization of BGPs alone (combined with common equivalent rewritings) already yields significant performance gains. Important features for federated query processing are the federation extensions proposed for the upcoming SPARQL 1.1 language definition. These allow to specify data sources directly within the query using the SERVICE operator, and moreover to attach mappings to the query as data using the BINDINGS operator. When implementing the SPARQL 1.1 federation extensions for our next release,FedX can exploit these language features to further improve performance. In fact, the SPARQL 1.1 SERVICE keyword is a trivial extension, which enhances our source selection approach with possibilities for manual specification of new sources and gives the query designer more control. Statistics can in uence performance tremendously in a distributed setting. Currently, FedX does not use any local statistics since we follow the design goal of on-demand federation setup. We aim at providing a federation framework, in which data sources can be integrated ad-hoc, and used immediately for query processing. In a future release, (remote) statistics (e.g., using VoID ) can be incorporated for source selection and to further improve our join order algorithm.
further improve our join order algorithm. +
Has future workWhile we focused on optimization technique
While we focused on optimization techniques for conjunctive queries, namely basic graph patterns (BGPs), there is additional potential in developing novel, operator-specific optimization techniques for distributed settings (in particular for OPTIONAL queries), which we are planning to address in future work. In a future release, (remote) statistics (e.g., using VoID) can be incorporated for source selection and to further improve our join order algorithm.
further improve our join order algorithm. +
Has keywordsNot available +
Has motivationThe lack of a global schema, querying data from multiple sources could be solved by querying distributed datasets +
Has platformSesame +
Has problemSPARQL Query Federation +
Has relatedProblemFind optimization techniques that allow for efficient SPARQL query processing on federated Linked Data +
Has subjectQuerying Distributed RDF Data Sources +
Has vendorNo data available now. +
Has year2011 +
ImplementedIn ProgLangJava +
Proposes Algorithm{{{ProposesAlgorithm}}} +
RunsOn OSWindows 2008 Server 64bit +
TitleFedX: Optimization Techniques for Federated Query Processing on Linked Data +
Uses FrameworkNo data available now. +
Uses Methodology{{{Methodology}}} +
Uses ToolboxNo data available now. +