FedX: Optimization Techniques for Federated Query Processing on Linked Data

From Openresearch
Revision as of 11:41, 26 March 2018 by Said (talk | contribs) (Created page with "{{Paper |Title=FedX: Optimization Techniques for Federated Query Processing on Linked Data |Subject=Querying Distributed RDF Data Sources |Authors=Andreas Schwarte, Peter Haas...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
FedX: Optimization Techniques for Federated Query Processing on Linked Data
FedX: Optimization Techniques for Federated Query Processing on Linked Data
Bibliographical Metadata
Subject: Querying Distributed RDF Data Sources
Year: 2011
Authors: Andreas Schwarte, Peter Haase, Katja Hose, Ralf Schenkel, Michael Schmidt
Venue ISWC
Content Metadata
Implementation: FedX

Abstract

Motivated by the ongoing success of Linked Data and the growing amount of semantic data sources available on the Web, new challenges to query processing are emerging. Especially in distributed settings that require joining data provided by multiple sources, sophisticated optimization techniques are necessary for efficient query processing. We propose novel join processing and grouping techniques to minimize the number of remote requests, and develop an effective solution for source selection in the absence of preprocessed metadata. We present FedX, a practical framework that enables efficient SPARQL query processing on heterogeneous, virtually integrated Linked Data sources. In experiments, we demonstrate the practicability and efficiency of our framework on a set of real-world queries and data sources from the Linked Open Data cloud. With FedX we achieve a significant improvement in query performance over state-of-the-art federated query engines.

Conclusion

In this paper, we proposed novel optimization techniques for efficient SPARQL query processing in the federated setting. As revealed by our benchmarks, bound joins combined with our grouping and source selection approaches are effective in terms of performance. By minimizing the number of intermediate requests, we are able to improve query performance significantly compared to state-of-the-art systems. We presented FedX, a practical solution that allows for querying multiple distributed Linked Data sources as if the data resides in a virtually integrated RDF graph. Compatible with the SPARQL 1.0 query language, our framework allows clients to integrate available SPARQL endpoints on-demand into a federation without any local preprocessing. While we focused on optimization techniques for conjunctive queries, namely basic graph patterns (BGPs), there is additional potential in developing novel, operator-specific optimization techniques for distributed settings (in particular for OPTIONAL queries), which we are planning to address in future work. As our experiments confirm, the optimization of BGPs alone (combined with common equivalent rewritings) already yields significant performance gains. Important features for federated query processing are the federation extensions proposed for the upcoming SPARQL 1.1 language definition. These allow to specify data sources directly within the query using the SERVICE operator, and moreover to attach mappings to the query as data using the BINDINGS operator. When implementing the SPARQL 1.1 federation extensions for our next release,FedX can exploit these language features to further improve performance. In fact, the SPARQL 1.1 SERVICE keyword is a trivial extension, which enhances our source selection approach with possibilities for manual specification of new sources and gives the query designer more control. Statistics can in uence performance tremendously in a distributed setting. Currently, FedX does not use any local statistics since we follow the design goal of on-demand federation setup. We aim at providing a federation framework, in which data sources can be integrated ad-hoc, and used immediately for query processing. In a future release, (remote) statistics (e.g., using VoID ) can be incorporated for source selection and to further improve our join order algorithm.

Future work

While we focused on optimization techniques for conjunctive queries, namely basic graph patterns (BGPs), there is additional potential in developing novel, operator-specific optimization techniques for distributed settings (in particular for OPTIONAL queries), which we are planning to address in future work. In a future release, (remote) statistics (e.g., using VoID) can be incorporated for source selection and to further improve our join order algorithm.

Approach

Positive Aspects: {{{PositiveAspects}}}

Negative Aspects: {{{NegativeAspects}}}

Limitations: {{{Limitations}}}

Challenges: {{{Challenges}}}

Proposes Algorithm: {{{ProposesAlgorithm}}}

Methodology: {{{Methodology}}}

Requirements: {{{Requirements}}}

Limitations: {{{Limitations}}}

Implementations

Download-page: {{{Download-page}}}

Access API: {{{API}}}

Information Representation: {{{InfoRepresentation}}}

Data Catalogue: {{{Catalogue}}}

Runs on OS: {{{OS}}}

Property "RunsOn OS" (as page type) with input value "{{{OS}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Vendor: {{{vendor}}}

Uses Framework: {{{Framework}}}

Has Documentation URL: {{{DocumentationURL}}}

Programming Language: {{{ProgLang}}}

Property "ImplementedIn ProgLang" (as page type) with input value "{{{ProgLang}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Version: {{{Version}}}

Platform: {{{Platform}}}

Toolbox: {{{Toolbox}}}

GUI: No

Research Problem

Subproblem of: {{{Subproblem}}}

Property "Has Subproblem" (as page type) with input value "{{{Subproblem}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

RelatedProblem: {{{RelatedProblem}}}

Property "Has relatedProblem" (as page type) with input value "{{{RelatedProblem}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Motivation: {{{Motivation}}}

Evaluation

Experiment Setup: {{{ExperimentSetup}}}

Evaluation Method : {{{EvaluationMethod}}}

Hypothesis: {{{Hypothesis}}}

Description: {{{Description}}}

Dimensions: {{{Dimensions}}}

Benchmark used: {{{Benchmark}}}

Property "Has Benchmark" (as page type) with input value "{{{Benchmark}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Results: {{{Results}}}

Access API{{{API}}} +
Event in seriesISWC +
Has Challenges{{{Challenges}}} +
Has DataCatalouge{{{Catalogue}}} +
Has Description{{{Description}}} +
Has Dimensions{{{Dimensions}}} +
Has DocumentationURLhttp://{{{DocumentationURL}}} +
Has Downloadpagehttp://{{{Download-page}}} +
Has EvaluationMethod{{{EvaluationMethod}}} +
Has ExperimentSetup{{{ExperimentSetup}}} +
Has GUINo +
Has Hypothesis{{{Hypothesis}}} +
Has ImplementationFedX +
Has InfoRepresentation{{{InfoRepresentation}}} +
Has Limitations{{{Limitations}}} +
Has NegativeAspects{{{NegativeAspects}}} +
Has PositiveAspects{{{PositiveAspects}}} +
Has Requirements{{{Requirements}}} +
Has Results{{{Results}}} +
Has Version{{{Version}}} +
Has abstractMotivated by the ongoing success of Linked
Motivated by the ongoing success of Linked Data and the growing amount of semantic data sources available on the Web, new challenges to query processing are emerging. Especially in distributed settings that require joining data provided by multiple sources, sophisticated optimization techniques are necessary for efficient query processing. We propose novel join processing and grouping techniques to minimize the number of remote requests, and develop an effective solution for source selection in the absence of preprocessed metadata. We present FedX, a practical framework that enables efficient SPARQL query processing on heterogeneous, virtually integrated Linked Data sources. In experiments, we demonstrate the practicability and efficiency of our framework on a set of real-world queries and data sources from the Linked Open Data cloud. With FedX we achieve a significant improvement in query performance over state-of-the-art federated query engines.
state-of-the-art federated query engines. +
Has authorsAndreas Schwarte +, Peter Haase +, Katja Hose +, Ralf Schenkel + and Michael Schmidt +
Has conclusionIn this paper, we proposed novel optimizat
In this paper, we proposed novel optimization techniques for efficient SPARQL query processing in the federated setting. As revealed by our benchmarks, bound joins combined with our grouping and source selection approaches are effective in terms of performance. By minimizing the number of intermediate requests, we are able to improve query performance significantly compared to state-of-the-art systems. We presented FedX, a practical solution that allows for querying multiple distributed Linked Data sources as if the data resides in a virtually integrated RDF graph. Compatible with the SPARQL 1.0 query language, our framework allows clients to integrate available SPARQL endpoints on-demand into a federation without any local preprocessing. While we focused on optimization techniques for conjunctive queries, namely basic graph patterns (BGPs), there is additional potential in developing novel, operator-specific optimization techniques for distributed settings (in particular for OPTIONAL queries), which we are planning to address in future work. As our experiments confirm, the optimization of BGPs alone (combined with common equivalent rewritings) already yields significant performance gains. Important features for federated query processing are the federation extensions proposed for the upcoming SPARQL 1.1 language definition. These allow to specify data sources directly within the query using the SERVICE operator, and moreover to attach mappings to the query as data using the BINDINGS operator. When implementing the SPARQL 1.1 federation extensions for our next release,FedX can exploit these language features to further improve performance. In fact, the SPARQL 1.1 SERVICE keyword is a trivial extension, which enhances our source selection approach with possibilities for manual specification of new sources and gives the query designer more control. Statistics can in uence performance tremendously in a distributed setting. Currently, FedX does not use any local statistics since we follow the design goal of on-demand federation setup. We aim at providing a federation framework, in which data sources can be integrated ad-hoc, and used immediately for query processing. In a future release, (remote) statistics (e.g., using VoID ) can be incorporated for source selection and to further improve our join order algorithm.
further improve our join order algorithm. +
Has future workWhile we focused on optimization technique
While we focused on optimization techniques for conjunctive queries, namely basic graph patterns (BGPs), there is additional potential in developing novel, operator-specific optimization techniques for distributed settings (in particular for OPTIONAL queries), which we are planning to address in future work. In a future release, (remote) statistics (e.g., using VoID) can be incorporated for source selection and to further improve our join order algorithm.
further improve our join order algorithm. +
Has motivation{{{Motivation}}} +
Has platform{{{Platform}}} +
Has subjectQuerying Distributed RDF Data Sources +
Has vendor{{{vendor}}} +
Has year2011 +
Proposes Algorithm{{{ProposesAlgorithm}}} +
TitleFedX: Optimization Techniques for Federated Query Processing on Linked Data +
Uses Framework{{{Framework}}} +
Uses Methodology{{{Methodology}}} +
Uses Toolbox{{{Toolbox}}} +