PROPOSAL: Handling Asynchronous Operation Requests

 29 topics, 131 posts  » Share this       
Replies: 10 - Last Post: July 16, 2009 18:57
by: MarcHadley
showing 1 - 11 of 11
 
Posted: June 09, 2009 19:55 by Craig McClanahan
BACKGROUND

In many cases, requested operations (such as starting a VM) can take a significant (but variable) amount of time, and it is not appropriate for the server to delay responding to the request until the operation completes. There are several issues with the way that the current API deals with these scenarios:

  • Server returns 201 (for create operations) or 204 (for control operations) even though the operation has not actually been completed.

  • In some cases, the client can poll for completion by doing a GET on the underlying resource and monitoring a status field, but such status fields are not universally available.

  • Polling requests return the entire representation of the underlying resource, which can be computationally expensive on the server and wasteful if all the client wants to know is whether an operation has completed or not.

  • There is no general unambiguous way to know whether an asynchronous operation has completed (and possibly failed), without an in depth knowledge of individual status field values.


PROPOSAL SUMMARY

This is a proposal to modify the Sun Cloud API to address the shortcomings identified above. The basic concept is to support a unified approach for initiating, and checking for completion, of *all* asynchronous operations across the entire service. The proposal includes the following elements:

  • A new Status resource that can be used for checking
    completion status of an asynchronous operation.

  • A requirement that the server, when it accepts a request
    to begin an asynchronous operation, MUST return an
    HTTP status 202 ("Accepted"), along with the initial
    version of the Status resource.

  • The returned Status resource MUST include a "uri" field
    that can be used to GET an updated representation of
    the completion status.

  • Other possible mechanisms for completion notification
    are described in the PROPOSAL DETAILS section below.


The list of operations that should actually implement this new approach are listed in the AFFECTED OPERATIONS section.

PROPOSAL DETAILS

(1) New Resource Model:

A new "Status" resource model will be added (with media type "application/vnd.com.sun.cloud.Status+json"), with the following fields:

  • "uri" - URI upon which the client may perform GET
    operations to poll for completion. Each accepted
    asynchronous operation will receive a unique status
    URI, so that multiple operations may be initiated and
    tracked at once.

  • "status" - Integer code describing the completion
    status (0=success, nonzero=error code), returned
    only when "progress" returns 100.

  • "message" - Message suitable for reporting completion
    status to a human user, returned only when "progress"
    returns 100.

  • "progress" - Integer percent completed indicator, which
    MUST return 100 *only* when the operation has been
    completed (either successfully or unsuccessfully).

(2) Server Responsibilities on Accepting Asynchronous Operation Requests

A new section will be added to the "Common Behaviors" page, documenting the server's required behavior when it accepts an asychronous operation request:

  • Initial response MUST have HTTP status 202 ("Accepted"),
    with a media type of "application/vnd.com.sun.cloud.Status+json"
    and entity body containing the initial Status resource for this
    operation. In the status resource, the "uri" and "progress" fields
    MUST be populated, and the "progress" field MUST contain
    a value of 0 indicating that the operation is beginning.

  • The URI value returned in the initial response MUST respond
    to GET requests by returning an updated version of the Status
    resource. Typically, the "progress" field will be increased towards
    100, but MUST NOT be set to 100 until the operation completes.

  • When the operation has completed (either successfully or
    unsuccessfully), a "final" representation of the Status resource
    MUST be returned, with a "progress" field set to 100, and a
    "status" field set to 0 (for successful completion) or a non-zero
    value for unsuccessful completion.

  • Once the operation completes, the status URI for this operation
    MUST remain valid for some TBD time period (so a client can
    keep polling for the "final" Status and potentially miss a response).
    After that time, the server MAY start returning 404s for that URI.

  • The server MUST implement timeout or other mechanisms to
    ensure that an operation will be completed (either successfully
    or unsuccessfully) in some "reasonable" amount of time.

(3) Client Notification Options

Upon receipt of the 202 response indicating that the asynchronous operation has begun, the client MAY (but is not required to) poll the returned URI to monitor for completion. It is also possible for the client to monitor the entire representation of the underlying resource; however, clients SHOULD poll the status URI instead (if it is going to poll) due to lower resource consumption on the server.

In addition, server implementors MAY support a "webhook" mechanism for outbound (service back to client) notification of progress and completion without a requirement for continual polling. Such a mechanism would operate as follows:

  • The inbound representation of the operation request
    MAY contain a "webhook" field, whose value is a URI
    where the client expects a callback. If this field is not
    present, no webhook callback will be performed.

  • When the operation has completed (either successfully
    or unsuccessfully), the server will perform a POST
    request to the webhook URI, with a content type of
    "application/vnd.com.sun.cloud.Status+json" and an
    entity body containing the final Status resource for this operation.

  • Client can match a completion report back to the original
    request by comparing the "uri" field value to the one returned
    in the initial Status response, or by providing unique
    webhook URIs for each asynchronous request.

  • The server callback MUST be attempted only once, and there
    will be no error reporting visible to the client if the callback fails.
    Clients SHOULD fall back to polling the status URI (after
    a period of time) to reliably detect operation completion.

  • TBD: Authentication? Certificates for https callbacks?

AFFECTED OPERATIONS

The server MUST implement the asynchronous operation behavior described in this proposal for the following requests:

* Requests to Cluster Resources:
- Create VM
- Create VNet
- Control Cluster
- Delete Cluster

* Requests to VM Resources:
- Delete VM
- Control VM

The server MAY implement the asynchronous behavior described in this proposal for the following requests, or it MAY perform these operations synchronously (and therefore return a 204 for success or 4xx/5xx for failure):

* Requests to Snapshot Resources:
- Create Clone
- Roll Back Filesystem to Snapshot

* Requests to VDC Resources:
- Create Volume

* Requests to VM Resources:
- Attach VM to Public Address or VNet
- Detach VM from Public Address or VNet

* Requests to VNet Resources:
- Delete VNet

* Requests to Volume Resources:
- Delete Volume
 
Posted: June 15, 2009 23:12 by Tim Bray
This all seems sane; I have a couple of specific suggestions.


  • There should be a distinguished value (I suggest -1) for the progress field, meaning "I'm working on it but I don't really know how long it will take". I have spent enough time looking at "progress bars" that are out-&-out lies that I really don't want to force implementors to try to report information that they may not in principle have.
  • I suggest losing the web-hooks. Architecturally I love the notion, but let's grow this API incrementally. I am quite confident that implementors can build Status resources and clients can use them. Let's get that shaken down first and acquire some operational experience.
  • I would radically simplify the MUST/MAY around asynchronous requests. I suggest that for every request that is a PUT or a POST or a DELETE, the server MAY follow the asynchronous return path. Much simpler, gives implementors more flexibility, and I don't think it complicates clients' lives any more.
 
Posted: June 16, 2009 00:22 by Craig McClanahan
Regarding your points in order:

    * I think a "progress" value of zero accomplishes what you are looking for, because it essentially indicates progress has started.
    * I am OK with dumping web hooks for this time around.
    * Allowing the server any MAY latitude at all (on whether to return a status message or a different representation) actually does complicate life for a client library that wants to present an O-O view of the API). Simplest thing from the client perspective would be to require the server to *always* return a status resource, with progress set to 100 if it turns out to have been synchronous.
 
Posted: June 16, 2009 18:40 by Tim Bray
We're in sync on web-hooks.

I agree with the simplification on non-GET - specify that it it always has to be a 202 with a Status body. It means we lose the nice RESTian 201, but that seems in good sync with the reality: whether these operations are synchronous or not just isn't predictable; so the best model is asynchrony by default.

I still have heartburn about using "progress":0 this way. It seems there are two different semantics at work here. It could mean "I just got started, put up that progress bar and check back regularly" or it could mean "I sent a request to an exterior service with unpredictable latency, don't make any promises". Here are some alternatives:

  • Introduce a distinguished value like -1
  • Make the "progress" field optional. Have the completion be signaled by the "status" field switching from 0 to some HTTP status code.
 
Posted: June 22, 2009 18:30 by Tim Bray
Now that I'm implementing this, Im actually starting to understand it. I have two specific issues.

First, the proposal says "Each accepted asynchronous operation will receive a unique status URI, so that multiple operations may be initiated and tracked at once." First, this is hard to implement, and second, I'm not sure it really works. Imagine three different clients operating in parallel; two issue "connect" operations against some VM and one issues a "disconnect" operation. It's going to be a lot of work to sequence and identify these, and furthermore, this seems to encourage a bad practice. I've been guilty of more or less ignoring concurrency issues (because multiple operations in parallel on the same VDC component seemed like a vanishingly small use case) but if we're going to support these, the right way would be to require ETags support and thus rule out lost updates. In any case, it's easy to understand a status resource along the lines of "Does this VM exist yet?" or "Is this network attached yet?" but I'm not sure the unique-status-resource-per-op is anywhere near an 80/20 point in terms of cost-benefit.

The status resource needs another field for the (very common) case where you're creating something. Back in the synchronous days, this was no problem; you responded to the post with a 201 and a Location header for whatever it was you'd created plus the (redundant, strictly speaking) URI in the body of the returned representation. Now, when I create something and get a status resource to poll, I need a way to find out, when it records the operation is completed, what the URI of the new resource is. A naive suggestion is simply to have two fields, "status_uri" and "target_uri", the first being the typical "uri" field and the second to be used for things like new-resource URIs.
 
Posted: June 22, 2009 20:14 by Craig McClanahan
As Kenai forums do not have very good interspersed reply capabilities yet (although the outlook is promising on the RFE that I filed), I'm pulling out just bullet point topics and responding to them.

Regarding multiple operations, from the client perspective I think this is a MUST HAVE. Consider a scenario where I've got 20 webserver VMs deployed, but only 10 of them started at the moment to handle the current load. Suddenly, my service gets slashdotted (or, I guess, nowdays it would be twitterdotted), and I need five more webserver clients started *right now* to handle the load. I'm definitely going to want them started simultaneously (which the back end had *better* be able to support), not one at a time.

Regarding multiple operations on the *same* VM instance, I agree this should not be allowed -- perhaps return a 409 conflict. However, the likelihood of this situation actually occurring is dramatically lessened if the server would only return "controller" operation URIs for state changes that are valid from the current state, AND you "expire" an operation URI once the operation has been completed (so it can't accidentally get tried again later). Side note -- we might actually want a "cancel" operation to abort something that is in flight.

Regarding lost updates, I see only the following scenarios: (a) client didn't get the initial status response, so retries the POST but must be ready for a 409 if the server actually started this operation, and (b) client got the initial status response and fails to GET a status update successfully; retrying the GET works (per the HTTP spec) with no side effects. I guess I don't see a problem here?

I agree that we need both URIs, and am fine with "status_uri" and "target_uri" as the field names.

Regarding "hard to implement", I'm being an advocate for the client here, so it's a bit difficult to be sympathetic Smile. However, you're pretty much certain to need some persistent storage to track in flight operations (both to assign them unique URIs and to detect conflicts) -- but a simple table with a row per in flight operation seems pretty straightforward to me. Base the status_uri value on the auto-generated primary key of this table, and make the row go away when the operation is completed (successfully or unsuccessfully). This approach even survives a restart of the bridge server, as long as your database doesn't get munched.








 
Posted: July 16, 2009 18:32 by Craig McClanahan
Received a reply from Marc Hadley via email, who was having problems posting it so I'm forwarding it here:
==================================================================================
The topic reminded me of a proposal for async SOAP request-response I wrote back when WS-Addressing was all the rage:

http://lists.w3.org/Archives/Public/public-ws-async-tf/2005Feb/0005.html

I think this is pretty similar to what you are suggesting, the main difference being the use of an initial 303 to redirect the client to the status resource and the inclusion of a Retry-After header to give the client a rough estimate of how long to wait before the first poll. I was wondering if you considered this approach instead of the initial 202 and accompanying entity body ?

Marc.
 
Posted: July 16, 2009 18:42 by Craig McClanahan
Hi Marc,

A couple of thoughts on the approach you outlined in the message at the specified link.

* We don't have to worry about modifying SOAP (thank goodness Smile since
we're defining the initial API here.

* This seems to make the client do more work (receive a 303, parse the Location
header, do an extra GET) to get the initial status, and apparently every other
one as well.

* In some implementations of the cloud API, particular operations might actually
be synchronous ... meaning you'd get the initial status representation back
with progress already set to 100. Nothing extra for the client to do in that case.

* You don't specify it in your example, but does the status message include the
URI of itself? I would want that in order to be consistent with the rest of the
Cloud API, where you never have to worry about URIs that are not visible
in the representations.

* Retry-After is interesting, and could be incorporated into the proposed approach
if we liked it, but I'm not sure how valuable it really is.

Craig
 
Posted: July 16, 2009 18:57 by MarcHadley
(I figured out why I couldn't post, I had to bookmark the project first)

Taking your points in order:

- Amen to that

- There's is an extra GET but only after the specified Retry-After. I think the cost of parsing the Location from the HTTP header is probably lower than extracting it from the response body since you'll be parsing the headers in either case. Also, some HTTP libraries can follow the redirect automatically so it looks like one request to a client.

- If the operation is synchronous then you could just return 200 OK or 204 No Content.

- Yes, the status, if not complete, includes Location and Retry-After headers.

- I think the utility of Retry-After depends on how well you can predict the duration of the operation. It could certainly help to cut down the frequency of polls by an eager client.

Marc.
 
Posted: June 10, 2009 14:03 by jchalupa
Nice proposal! Thanks for putting it together.

Just very minor comments:

  • IMO, it should be OK to return the Status object with a non-zero progress value as a response to the initial request.
  • Alternatively, incomplete Status could still include the status field with a special In progress code and the "In Progress..." message.
  • Some of the affected operations in the MUST category (e.g. Create VNet) might actually be performed synchronously and could be moved into the MAY category. However, both categories will need a more detailed review.
 
Posted: July 03, 2009 11:55 by Sam Johnston
This is the direction I've been pushing for OCCI too... the actuators seem elegant at first but once you start thinking about things like abandoning requests in process, monitoring progress and asynchronous events in general a "request" resource makes a lot more sense. The example I use is a backup which may not start until midnight and may take 12 hours from then to complete. Anyway this is an extract from my post "Is HTTP the HTTP of cloud computing?" (http://samj.net/2009/05/is-http-http-of-cloud-computing.html) back in May - I haven't fully codified it yet but it will likely look something like what you guys are doing (perhaps dropping stuff like the target_uri field in favour of a single Location: header with content negotiation):

RESTful State Machines

Something else which has not sat well with me until I spent the weekend ingesting RESTful Web Services book (by Leonard Richardson and Sam Ruby) was the "actuator" concept we picked up from the Sun Cloud APIs. This breaks away from RESTful principles by exposing an RPC-style API for triggering state changes (e.g. start, stop, restart). Granted it's an improvement on the alternative (GETting a resource and PUTting it back with an updated state) as Tim Bray explains in RESTful Casuistry (to which Roy Fielding and Bill de hÓra also responded), but it still "feels funky". Sure it doesn't make any sense to try to "force" a monitored status to some other value (for example setting a "state" attribute to "running"), especially when we can't be sure that's the state we'll get to (maybe there will be an error or the transition will be dependent on some outcome over which we have no control). Similarly it doesn't make much sense to treat states as nouns, for example adding a "running" state to a collection of states (even if a resource can be "running" and "backing up" concurrently). But is using URLs as "buttons" representing verbs/transitions the best answer?

What makes more sense [to me] is to request a transition and check back for updates (e.g. by polling or HTTP server push). If it's RESTful to POST comments to an article (which in addition to its own contents acts as a collection of zero or more comments) then POSTing a request to change state to a [sub]resource also makes sense. As a bonus these can be parametrised (for example a "resize" request can be accompanied with a "size" parameter and a "stop" request sent with clarification as to whether an "ACPI Off" or "Pull Cord" is required). Transitions that take a while, like "format" on a storage resource, can simply return HTTP 201 Accepted so we've got support for asynchronous actions as well - indeed some requests (e.g. "backup") may not even be started immediately. We may also want to consider using something like Post Once Exactly (POE) to ensure that requests like "restart" aren't executed repeatedly and that we can cancel requests that the system hasn't had a chance to deal with yet.

Exactly how this should look in terms of URL layout I'm not sure (perhaps http://example.com/<resource>/requests) but being able to enumerate the possible actions as well as acceptable parameters (e.g. an enum for variations on "stop" or a range for "resize") would be particularly useful for clients.

showing 1 - 11 of 11
Replies: 10 - Last Post: July 16, 2009 18:57
by: MarcHadley
  • Mysql
  • Glassfish
  • Jruby
  • Rails
  • Nblogo
Terms of Use; Privacy Policy;
© 2010, Oracle Corporation and/or its affiliates
(revision 20100521.d19488a)
 
 
loading
Please Confirm