SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions

From Openresearch
Jump to: navigation, search
SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions
SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions
Bibliographical Metadata
Subject: Querying Distributed RDF Data Sources
Year: 2011
Authors: Olaf Gorlitz, Steffen Staab
Venue COLD
Content Metadata
Problem: SPARQL Query Federation
Approach: query optimization strategy for federating SPARQL endpoints based on statistical data obtained from voiD descriptions
Implementation: SPLENDID
Evaluation: query execution performance evaluation

Abstract

In order to leverage the full potential of the Semantic Web it is necessary to transparently query distributed RDF data sources in the same way as it has been possible with federated databases for ages. However, there are significant differences between the Web of (linked) Data and the traditional database approaches. Hence, it is not straightforward to adapt successful database techniques for RDF federation. Reasons are the missing cooperation between SPARQL endpoints and the need for detailed data statistics for estimating the costs of query execution plans. We have implemented SPLENDID, a query optimization strategy for federating SPARQL endpoints based on statistical data obtained from voiD descriptions.

Conclusion

SPLENDID allows for transparent query federation over distributed SPARQL endpoints. In order to achieve a good query execution performance, data source selection and query optimization is based on basic statistical information which is obtained from VOID descriptions. The utilization of open semantic web standards, like VOID and SPARQL endpoints, allows for flexible integration of various distributed and linked RDF data sources. We have described in detail the implementation of the data source selection and the join order optimization. The evaluation shows that our approach can achieve good query performance and is competitive compared to other state-of-the-art federation implementations. In our analysis of the source selection we came to the conclusion that at least predicate and type statistics should be included in VOID description for RDF datasets. The use of 3rd party sameAs links, however, can significantly increase the number of requests and thus, hamper the efficiency of query execution plans. The comparison of the two employed physical join implementations has shown that the network overhead plays an important role. Both hash join and bind join can significantly reduce the query processing time for certain types of queries. With SPLENDID we also like to advocate the adoption of VOID statistics for Linked Data. As next steps, we plan to investigate whether VOID descriptions can easily be extended with more detailed statistics in order to allow for more accurate cardinality estimates and, thus, better query execution plans. On the other hand, the actual query execution has not yet been optimized in SPLENDID. Therefore, we plan to integrate optimization techniques as used in FedX. Moreover, the adoption of the SPARQL 1.1 federation extension will also allow for more efficient query execution.

Future work

As next steps, we plan to investigate whether VOID descriptions can easily be extended with more detailed statistics in order to allow for more accurate cardinality estimates and, thus, better query execution plans. On the other hand, the actual query execution has not yet been optimized in SPLENDID. Therefore, we plan to integrate optimization techniques as used in FedX. Moreover, the adoption of the SPARQL 1.1 federation extension will also allow for more efficient query execution.

Approach

Positive Aspects: {{{PositiveAspects}}}

Negative Aspects: {{{NegativeAspects}}}

Limitations: {{{Limitations}}}

Challenges: {{{Challenges}}}

Proposes Algorithm: {{{ProposesAlgorithm}}}

Methodology: {{{Methodology}}}

Requirements: {{{Requirements}}}

Limitations: {{{Limitations}}}

Implementations

Download-page: https://github.com/semagrow/fork-splendid-server

Access API: No data available now.

Information Representation: RDF

Data Catalogue: VoID

Runs on OS: OS independent

Vendor: Open source

Uses Framework: -

Has Documentation URL: No data available now.

Programming Language: Java

Version: 1.0

Platform: Sesame

Toolbox: No data available now.

GUI: No

Research Problem

Subproblem of: Querying Distributed RDF Data Sources

RelatedProblem: retrieve and join the result tuples

Motivation: {{{Motivation}}}

Evaluation

Experiment Setup: Due to the unpredictable availability and latency of the original SPARQL endpoints of the benchmark dataset we used local copies of them which were hosted on five 64bit Intel(R) Xeon(TM) CPU 3.60GHz server instances running Sesame 2.4.2 with each instance providing the SPARQL endpoint for one life science and for one cross domain dataset. The evaluation was performed on a separate server instance with 64bit Intel(R) Xeon(TM) CPU 3.60GHz and a 100Mbit network connection.

Evaluation Method : The goal of the evaluation is to show that SPLENDID is able to achieve good query execution performance for real world federation scenarios.

Hypothesis: -

Description: we investigated how the information from the VOID descriptions effect the accuracy of the source selection. For each query, we look at the number of sources selected and the resulting number of requests to the SPARQL endpoints. We tested three different source selection approaches, based on 1) predicate index only (no type information), 2) predicate and type index, and 3) predicate and type index and grouping of sameAs patterns as described in Section 4.2.

Dimensions: Performance

Benchmark used: FedBench

Results: AliBaba and DARQ fail to return results for six out of the 14 queries for different reasons. AliBaba generates malformed sub queries for CD3, CD5, LS6, and LS7. DARQ can not handle the unbound predicate in CD1 and LS2. For CD3 and CD5 DARQ opens too many connections to GeoNames. All other unsuccessful queries take longer than the time limit of five minutes. Overall, FedX has the best query evaluation performance. The reason is its novel and efficient query execution based on block transmission of result tuples and parallelization of joins. However, there is only a significant difference between FedX and SPLENDID for CD6, CD7, LS3, LS5-7. For the other queries SPLENDID is close to FedX and for CD3 and CD4 even slightly faster, which indicates that SPLENDID, indeed, generates better query execution plans.

Access APINo data available now. +
Event in seriesCOLD +
Has BenchmarkFedBench +
Has Challenges{{{Challenges}}} +
Has DataCatalougeVoID +
Has Descriptionwe investigated how the information from t
we investigated how the information from the VOID

descriptions effect the accuracy of the source selection. For each query, we look at the number of sources selected and the resulting number of requests to the SPARQL endpoints. We tested three different source selection approaches, based on 1) predicate index only (no type information), 2) predicate and type index, and 3) predicate and type

index and grouping of sameAs patterns as described in Section 4.2.
meAs patterns as described in Section 4.2. +
Has DimensionsPerformance +
Has DocumentationURLhttp://No data available now. +
Has Downloadpagehttps://github.com/semagrow/fork-splendid-server +
Has EvaluationQuery execution performance evaluation +
Has EvaluationMethodThe goal of the evaluation is to show that SPLENDID is able to achieve good query execution performance for real world federation scenarios. +
Has ExperimentSetupDue to the unpredictable availability and
Due to the unpredictable availability and latency of the original SPARQL endpoints

of the benchmark dataset we used local copies of them which were hosted on five 64bit Intel(R) Xeon(TM) CPU 3.60GHz server instances running Sesame 2.4.2 with each instance providing the SPARQL endpoint for one life science and for one cross domain dataset. The evaluation was performed on a separate server instance with 64bit Intel(R)

Xeon(TM) CPU 3.60GHz and a 100Mbit network connection.
3.60GHz and a 100Mbit network connection. +
Has GUINo +
Has Hypothesis- +
Has ImplementationSPLENDID +
Has InfoRepresentationRDF +
Has Limitations{{{Limitations}}} +
Has NegativeAspects{{{NegativeAspects}}} +
Has PositiveAspects{{{PositiveAspects}}} +
Has Requirements{{{Requirements}}} +
Has ResultsAliBaba and DARQ fail to return results fo
AliBaba and DARQ fail to return results for six out of the 14 queries for

different reasons. AliBaba generates malformed sub queries for CD3, CD5, LS6, and LS7. DARQ can not handle the unbound predicate in CD1 and LS2. For CD3 and CD5 DARQ opens too many connections to GeoNames. All other unsuccessful queries take longer than the time limit of five minutes. Overall, FedX has the best query evaluation performance. The reason is its novel and efficient query execution based on block transmission of result tuples and parallelization of joins. However, there is only a significant difference between FedX and SPLENDID for CD6, CD7, LS3, LS5-7. For the other queries SPLENDID is close to FedX and for CD3 and CD4 even slightly faster, which

indicates that SPLENDID, indeed, generates better query execution plans.
d, generates better query execution plans. +
Has SubproblemQuerying Distributed RDF Data Sources +
Has Version1.0 +
Has abstractIn order to leverage the full potential of
In order to leverage the full potential of the Semantic Web it is necessary to transparently query distributed RDF data sources in the same way as it has been possible with federated databases for ages. However, there are significant differences between the Web of (linked) Data and the traditional database approaches. Hence, it is not straightforward to adapt successful database techniques for RDF federation. Reasons are the missing cooperation between SPARQL endpoints and the need for detailed data statistics for estimating the costs of query execution plans. We have implemented SPLENDID, a query optimization strategy for federating SPARQL endpoints based on statistical data obtained from voiD descriptions.
ical data obtained from voiD descriptions. +
Has approachquery optimization strategy for federating SPARQL endpoints based on statistical data obtained from voiD descriptions +
Has authorsOlaf Gorlitz + and Steffen Staab +
Has conclusionSPLENDID allows for transparent query fede
SPLENDID allows for transparent query federation over distributed SPARQL endpoints. In order to achieve a good query execution performance, data source selection and query optimization is based on basic statistical information which is obtained from VOID descriptions. The utilization of open semantic web standards, like VOID and SPARQL endpoints, allows for flexible integration of various distributed and linked RDF data sources. We have described in detail the implementation of the data source

selection and the join order optimization. The evaluation shows that our approach can achieve good query performance and is competitive compared to other state-of-the-art federation implementations. In our analysis of the source selection we came to the conclusion that at least predicate and type statistics should be included in VOID description for RDF datasets. The use of 3rd party sameAs links, however, can significantly increase the number of requests and thus, hamper the efficiency of query execution plans. The comparison of the two employed physical join implementations has shown that the network overhead plays an important role. Both hash join and bind join can significantly reduce the query processing time for certain types of queries. With SPLENDID we also like to advocate the adoption of VOID statistics for Linked Data. As next steps, we plan to investigate whether VOID descriptions can easily be extended with more detailed statistics in order to allow for more accurate cardinality estimates and, thus, better query execution plans. On the other hand, the actual query

execution has not yet been optimized in SPLENDID. Therefore, we plan to integrate optimization techniques as used in FedX. Moreover, the adoption of the SPARQL 1.1 federation extension will also allow for more efficient query execution.
allow for more efficient query execution. +
Has future workAs next steps, we plan to investigate whet
As next steps, we plan to investigate whether VOID descriptions can easily be extended with more detailed statistics in order to allow for more accurate cardinality estimates and, thus, better query execution plans. On the other hand, the actual query execution has not yet been optimized in SPLENDID. Therefore, we plan to integrate optimization techniques as used in FedX. Moreover, the adoption of the SPARQL 1.1 federation extension will also allow for more efficient query execution.
allow for more efficient query execution. +
Has motivation{{{Motivation}}} +
Has platformSesame +
Has problemSPARQL Query Federation +
Has relatedProblemRetrieve and join the result tuples +
Has subjectQuerying Distributed RDF Data Sources +
Has vendorOpen source +
Has year2011 +
ImplementedIn ProgLangJava +
Proposes Algorithm{{{ProposesAlgorithm}}} +
RunsOn OSOS independent +
TitleSPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions +
Uses Framework- +
Uses Methodology{{{Methodology}}} +
Uses ToolboxNo data available now. +