Difference between revisions of "ANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints"

From Openresearch
Jump to: navigation, search
Line 1: Line 1:
 
{{Paper
 
{{Paper
|title=ANAPSIDs
+
|title=ANAPSID
 
|Title=ANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints
 
|Title=ANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints
 
|Field=Federated Question Answering
 
|Field=Federated Question Answering

Revision as of 16:05, 21 March 2018

ANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints
ANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints
Bibliographical Metadata
Authors: Maribel Acosta, Maria-Esther Vidal, Tomas Lampo, Julio Castillo
Content Metadata
Problem: SPARQL Query FederationQuery ExecutionSource Selection
Approach: Querying Distributed RDF Data Sources

Abstract

Following the design rules of Linked Data, the number of available SPARQL endpoints that support remote query processing is quickly growing; read more.,

Conclusion

{{{Conclusion}}}

Future work

{{{Future work}}}

Approach

Positive Aspects: {{{PositiveAspects}}}

Negative Aspects: {{{NegativeAspects}}}

Limitations: {{{Limitations}}}

Challenges: {{{Challenges}}}

Proposes Algorithm: {{{ProposesAlgorithm}}}

Methodology: {{{Methodology}}}

Requirements: {{{Requirements}}}

Limitations: {{{Limitations}}}

Implementations

Download-page: {{{Download-page}}}

Access API: {{{API}}}

Information Representation: {{{InfoRepresentation}}}

Data Catalogue: {{{Catalogue}}}

Runs on OS: {{{OS}}}

Property "RunsOn OS" (as page type) with input value "{{{OS}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Vendor: {{{vendor}}}

Uses Framework: {{{Framework}}}

Has Documentation URL: {{{DocumentationURL}}}

Programming Language: {{{ProgLang}}}

Property "ImplementedIn ProgLang" (as page type) with input value "{{{ProgLang}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Version: {{{Version}}}

Platform: {{{Platform}}}

Toolbox: {{{Toolbox}}}

GUI: {{{GUI}}}

Research Problem

Subproblem of: {{{Subproblem}}}

Property "Has Subproblem" (as page type) with input value "{{{Subproblem}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

RelatedProblem: {{{RelatedProblem}}}

Property "Has relatedProblem" (as page type) with input value "{{{RelatedProblem}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Motivation: {{{Motivation}}}

Evaluation

Experiment Setup: {{{ExperimentSetup}}}

Evaluation Method : {{{EvaluationMethod}}}

Hypothesis: {{{Hypothesis}}}

Description: {{{Description}}}

Dimensions: {{{Dimensions}}}

Benchmark used: {{{Benchmark}}}

Property "Has Benchmark" (as page type) with input value "{{{Benchmark}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Results: {{{Results}}}

Abstract

Following the design rules of Linked Data, the number of available SPARQL endpoints that support remote query processing is quickly growing; however, because of the lack of adaptivity, query executions may frequently be unsuccessful. First, fixed plans identified following the traditional optimize-then execute paradigm may time out as a consequence of endpoint availability. Second, because blocking operators are usually implemented, endpoint query engines are not able to incrementally produce results and may become blocked if data sources stop sending data. We present ANAPSID, an adaptive query engine for SPARQL endpoints that adapts query execution schedules to data availability and run-time conditions. ANAPSID provides physical SPARQL operators that detect when a source becomes blocked or data traÆc is bursty, and opportunistically, the operators produce results as quickly as data arrives from the sources. Additionally, ANAPSID operators implement main memory replacement policies to move previously computed matches to secondary memory avoiding duplicates. We compared ANAPSID performance with respect to RDF stores and endpoints and observed that ANAPSID speeds up execution time, in some cases, in more than one order of magnitude.

Technology

Suports SPARQL 1.1


ANAPSID Query Processing Engine

The ANAPSID query engine provides a set of operators able to gather data from different endpoints. Opportunistically, these operators produce results by joining tuples previously received even when endpoints become blocked. Additionally, the physical operators implement main memory replacement policies to move previously computed matches to secondary memory, ensuring no duplicate generation. Each join operator maintains a data structure called Resource Join Tuple (RJT), that records for each instantiation of the joint variable(s), the tuples that have already matched. Suppose that or the instantiation of the variable ?X with the resource r, the tuples {T1, ..., Tn} have matched, then the RJT will be the pair (r, {T1, ..., Tn}), where the first argument, head of the RJT, corresponds to the resource and the second, tail of the RJT, is the list of tuples.

Conclusion

We have defined ANAPSID, an adaptive query processing engine for RDF Linked Data accessible through SPARQL endpoints. ANAPSID provides a set of physical operators and an execution engine able to adapt the query execution to the availability of the endpoints and to hide delays from users. Reported experimental results suggest that our proposed techniques reduce execution times and are able to produce answers when other engines fail. Also, depending on the selectivity of the join operator and the data transfer delays, ANAPSID operators may overcome state-of-the-art Symmetric Hash Join operators. In the future, we plan to extend ANAPSID with more powerful and lightweight operators like Eddy and MJoin, which are able to route received responses through different operators and adapt the execution to unpredictable delays by changing the order in which each data item is routed.

Access API- +
Event in seriesISWC +
Has BenchmarkLinkedCT and DBPedia (english articles) + and LinkedSensorData-blizzards +
Has ChallengesQuery Decomposition, Query Optimization, and Query adaptation. +
Has DataCatalougePredicate list and endpoint status +
Has DescriptionWe report on runtime performance, which co
We report on runtime performance, which corresponds to the user time produced by the _ 􀀀_ command of the Unix operation system. Experiments in RDF-3X were run in both cold and warm caches; to run cold cache, we executed the same query five times by dropping the cache just before running the first iteration of the query. Each query executed by ANAPSID and SPARQL endpoints was run ten times, and we report on the average time.
times, and we report on the average time. +
Has DimensionsPerformance +
Has DocumentationURLhttps://github.com/anapsid/anapsid +
Has Downloadpagehttps://github.com/anapsid/anapsid +
Has EvaluationExecution time evaluation +
Has EvaluationMethodreport on the execution time of plans comprised of ANAPSID operators versus queries posed against SPARQL endpoints, and state-of-the-art RDF engines +
Has ExperimentSetupWe empirically analyze the performance of
We empirically analyze the performance of the proposed query processing techniques and report on the execution time of plans comprised of ANAPSID operators versus queries posed against SPARQL endpoints, and state-of-the-art RDF engines.

Three sets of queries were considered (Table of Figure 5(b)); each sub-query was executed as a query against its corresponding endpoint. Benchmark 1 is a set of 10 queries against LinkedSensorData-blizzards; each query can be grouped into 4 or 5 sub-queries. Benchmark 2 is a set of 10 queries over linkedCT with 3 or 4 subqueries. Benchmark 3 is a set of 10 queries with 4 or 5 sub-queries executed against linkedCT and DBPedia endpoints.

Experiments were executed on a Linux CentOS machine with an Intel Pentium Core2 Duo 3.0 GHz and 8GB RAM.
tel Pentium Core2 Duo 3.0 GHz and 8GB RAM. +
Has GUIYes +
Has Hypothesis- +
Has ImplementationANAPSID +
Has InfoRepresentationRDF +
Has Limitations- +
Has NegativeAspects- +
Has PositiveAspects- decompose the query into simple sub-plan
- decompose the query into simple sub-plans that can be executed by the remote endpoints.

- propose a set of physical operators that gather data generated by the endpoints, and quickly produce responses.

- an execution engine able to adapt the query execution to the availability of the endpoints and to hide delays from users.
e endpoints and to hide delays from users. +
Has Requirements{{{Requirements}}} +
Has ResultsWe observe that SHJ and ANAPSID operators
We observe that SHJ and ANAPSID operators are able to produce the first tuple faster than ARQ or Hash join, even in an ideal scenario with no delays; further, ARQ performance is clearly aff_ected by data transfer distribution and its execution time can be almost two orders of magnitude greater than the time of SHJ or ANAPSID. We notice that SHJ and ANAPSID are competitive, this is because the number of intermediate results is very small, and the benefits of the RJTs cannot be exploited. This suggests that the performance of ANAPSID operators depends on the selectivity of the join operator and the data transfer delays.
oin operator and the data transfer delays. +
Has SubproblemQuery processing on Linked Data +
Has Version1 +
Has abstractFollowing the design rules of Linked Data,
Following the design rules of Linked Data, the number of available SPARQL endpoints that support remote query processing is quickly growing; however, because of the lack of adaptivity, query executions may frequently be unsuccessful. First, fixed plans identified following the traditional optimize-then execute paradigm, may timeout as a consequence of endpoint availability. Second, because blocking operators are usually implemented, endpoint query engines are not able to incrementally produce results, and may become blocked if data sources stop sending data. We present ANAPSID, an adaptive query engine for SPARQL endpoints that adapts query execution schedulers to data availability and run-time conditions. ANAPSID provides physical SPARQL operators that detect when a source becomes blocked or data traÆc is bursty, and opportunistically, the operators produce results as quickly as data arrives from the sources. Additionally, ANAPSID operators implement main memory replacement policies to move previously computed matches to secondary memory avoiding duplicates. We compared ANAPSID performance with respect to RDF stores and endpoints, and observed that ANAPSID speeds up execution time, in some cases, in more than one order of magnitude.
ases, in more than one order of magnitude. +
Has approachQuerying Distributed RDF Data Sources, +
Has authorsMaribel Acosta +, Maria-Esther Vidal +, Tomas Lampo + and Julio Castillo +
Has conclusionWe have defined ANAPSID, an adaptive query
We have defined ANAPSID, an adaptive query processing engine for RDF Linked Data accessible through SPARQL endpoints. ANAPSID provides a set of physical operators and an execution engine able to adapt the query execution to the availability of the endpoints and to hide delays from users. Reported experimental results suggest that our proposed techniques reduce execution times and are able to produce answers when other engines fail. Also, depending on the selectivity of the join operator and the data transfer delays, ANAPSID operators may overcome state-of-the-art Symmetric Hash Join operators. In the future, we plan to extend ANAPSID with more powerful and lightweight operators like Eddy and MJoin, which are able to route received responses through different operators and adapt the execution to unpredictable delays by changing the order in which each data item is routed.
e order in which each data item is routed. +
Has future workIn the future we plan to extend ANAPSID wi
In the future we plan to extend ANAPSID with more powerful and lightweight operators like Eddy and MJoin, which are able to route received responses through different operators, and adapt the execution to unpredictable delays by changing the order in which each data item is routed.
e order in which each data item is routed. +
Has keywordsAdaptive Query Processing, ANAPSID, Linked Data +
Has motivationdistrinution of RDF datastores +
Has platformANAPSID +
Has problemSPARQL Query Federation +
Has relatedProblemDecompose queries into sub-queries that can be executed by the selected endpoints +
Has subjectAdaptive Query Processing +
Has vendor- +
Has year2011 +
ImplementedIn ProgLangPython 2.6.5 +
Proposes Algorithm- +
RunsOn OSLinux CentOS +
TitleANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints +
Uses FrameworkTwisted Network framework +
Uses MethodologyLightweight wrappers translate SPARQL sub-
Lightweight wrappers translate SPARQL sub-queries into calls to endpoints as well as convert endpoint answers into ANAPSID internal structures. Mediators maintain information about endpoint capabilities, statistics that describe their content and performance, and the ontology used to describe the data accessible through the endpoint.

The Local As View (LAV) approach is used to describe endpoints in terms of the ontology used in the endpoint dataset. Further, mediators implement query rewriting techniques, decompose queries into sub-queries against the endpoints, and gather data retrieved from the contacted endpoints. Currently, only SPARQL queries comprised of joins are considered; however, the rewriting techniques have been extended to consider all SPARQL operators,

but this piece of work is out of the scope of this paper. Finally, mediators hide delays, and produce answers as quickly as data arrives.
roduce answers as quickly as data arrives. +
Uses Toolbox- +