Craig McClanahan
|
Posted: June 09, 2009 19:55 by Craig McClanahan
|
|
BACKGROUND In many cases, requested operations (such as starting a VM) can take a significant (but variable) amount of time, and it is not appropriate for the server to delay responding to the request until the operation completes. There are several issues with the way that the current API deals with these scenarios:
PROPOSAL SUMMARY This is a proposal to modify the Sun Cloud API to address the shortcomings identified above. The basic concept is to support a unified approach for initiating, and checking for completion, of *all* asynchronous operations across the entire service. The proposal includes the following elements:
The list of operations that should actually implement this new approach are listed in the AFFECTED OPERATIONS section. PROPOSAL DETAILS (1) New Resource Model: A new "Status" resource model will be added (with media type "application/vnd.com.sun.cloud.Status+json"), with the following fields:
(2) Server Responsibilities on Accepting Asynchronous Operation Requests A new section will be added to the "Common Behaviors" page, documenting the server's required behavior when it accepts an asychronous operation request:
(3) Client Notification Options Upon receipt of the 202 response indicating that the asynchronous operation has begun, the client MAY (but is not required to) poll the returned URI to monitor for completion. It is also possible for the client to monitor the entire representation of the underlying resource; however, clients SHOULD poll the status URI instead (if it is going to poll) due to lower resource consumption on the server. In addition, server implementors MAY support a "webhook" mechanism for outbound (service back to client) notification of progress and completion without a requirement for continual polling. Such a mechanism would operate as follows:
AFFECTED OPERATIONS The server MUST implement the asynchronous operation behavior described in this proposal for the following requests: * Requests to Cluster Resources: - Create VM - Create VNet - Control Cluster - Delete Cluster * Requests to VM Resources: - Delete VM - Control VM The server MAY implement the asynchronous behavior described in this proposal for the following requests, or it MAY perform these operations synchronously (and therefore return a 204 for success or 4xx/5xx for failure): * Requests to Snapshot Resources: - Create Clone - Roll Back Filesystem to Snapshot * Requests to VDC Resources: - Create Volume * Requests to VM Resources: - Attach VM to Public Address or VNet - Detach VM from Public Address or VNet * Requests to VNet Resources: - Delete VNet * Requests to Volume Resources: - Delete Volume |
PROPOSAL: Handling Asynchronous Operation Requests
29 topics, 131 posts
» Share this
Replies: 10 - Last Post: July 16, 2009 18:57
by: MarcHadley
by: MarcHadley
showing 1 - 11 of 11
Tim Bray
|
Posted: June 15, 2009 23:12 by Tim Bray
|
This all seems sane; I have a couple of specific suggestions.
|
Craig McClanahan
|
Posted: June 16, 2009 00:22 by Craig McClanahan
|
|
Regarding your points in order: * I think a "progress" value of zero accomplishes what you are looking for, because it essentially indicates progress has started. * I am OK with dumping web hooks for this time around. * Allowing the server any MAY latitude at all (on whether to return a status message or a different representation) actually does complicate life for a client library that wants to present an O-O view of the API). Simplest thing from the client perspective would be to require the server to *always* return a status resource, with progress set to 100 if it turns out to have been synchronous. |
Tim Bray
|
Posted: June 16, 2009 18:40 by Tim Bray
|
|
We're in sync on web-hooks. I agree with the simplification on non-GET - specify that it it always has to be a 202 with a Status body. It means we lose the nice RESTian 201, but that seems in good sync with the reality: whether these operations are synchronous or not just isn't predictable; so the best model is asynchrony by default. I still have heartburn about using "progress":0 this way. It seems there are two different semantics at work here. It could mean "I just got started, put up that progress bar and check back regularly" or it could mean "I sent a request to an exterior service with unpredictable latency, don't make any promises". Here are some alternatives:
|
Tim Bray
|
Posted: June 22, 2009 18:30 by Tim Bray
|
|
Now that I'm implementing this, Im actually starting to understand it. I have two specific issues. First, the proposal says "Each accepted asynchronous operation will receive a unique status URI, so that multiple operations may be initiated and tracked at once." First, this is hard to implement, and second, I'm not sure it really works. Imagine three different clients operating in parallel; two issue "connect" operations against some VM and one issues a "disconnect" operation. It's going to be a lot of work to sequence and identify these, and furthermore, this seems to encourage a bad practice. I've been guilty of more or less ignoring concurrency issues (because multiple operations in parallel on the same VDC component seemed like a vanishingly small use case) but if we're going to support these, the right way would be to require ETags support and thus rule out lost updates. In any case, it's easy to understand a status resource along the lines of "Does this VM exist yet?" or "Is this network attached yet?" but I'm not sure the unique-status-resource-per-op is anywhere near an 80/20 point in terms of cost-benefit. The status resource needs another field for the (very common) case where you're creating something. Back in the synchronous days, this was no problem; you responded to the post with a 201 and a Location header for whatever it was you'd created plus the (redundant, strictly speaking) URI in the body of the returned representation. Now, when I create something and get a status resource to poll, I need a way to find out, when it records the operation is completed, what the URI of the new resource is. A naive suggestion is simply to have two fields, "status_uri" and "target_uri", the first being the typical "uri" field and the second to be used for things like new-resource URIs. |
Craig McClanahan
|
Posted: June 22, 2009 20:14 by Craig McClanahan
|
|
As Kenai forums do not have very good interspersed reply capabilities yet (although the outlook is promising on the RFE that I filed), I'm pulling out just bullet point topics and responding to them. Regarding multiple operations, from the client perspective I think this is a MUST HAVE. Consider a scenario where I've got 20 webserver VMs deployed, but only 10 of them started at the moment to handle the current load. Suddenly, my service gets slashdotted (or, I guess, nowdays it would be twitterdotted), and I need five more webserver clients started *right now* to handle the load. I'm definitely going to want them started simultaneously (which the back end had *better* be able to support), not one at a time. Regarding multiple operations on the *same* VM instance, I agree this should not be allowed -- perhaps return a 409 conflict. However, the likelihood of this situation actually occurring is dramatically lessened if the server would only return "controller" operation URIs for state changes that are valid from the current state, AND you "expire" an operation URI once the operation has been completed (so it can't accidentally get tried again later). Side note -- we might actually want a "cancel" operation to abort something that is in flight. Regarding lost updates, I see only the following scenarios: (a) client didn't get the initial status response, so retries the POST but must be ready for a 409 if the server actually started this operation, and (b) client got the initial status response and fails to GET a status update successfully; retrying the GET works (per the HTTP spec) with no side effects. I guess I don't see a problem here? I agree that we need both URIs, and am fine with "status_uri" and "target_uri" as the field names. Regarding "hard to implement", I'm being an advocate for the client here, so it's a bit difficult to be sympathetic . However, you're pretty much certain to need some persistent storage to track in flight operations (both to assign them unique URIs and to detect conflicts) -- but a simple table with a row per in flight operation seems pretty straightforward to me. Base the status_uri value on the auto-generated primary key of this table, and make the row go away when the operation is completed (successfully or unsuccessfully). This approach even survives a restart of the bridge server, as long as your database doesn't get munched. |
Craig McClanahan
|
Posted: July 16, 2009 18:32 by Craig McClanahan
|
|
Received a reply from Marc Hadley via email, who was having problems posting it so I'm forwarding it here: ================================================================================== The topic reminded me of a proposal for async SOAP request-response I wrote back when WS-Addressing was all the rage: http://lists.w3.org/Archives/Public/public-ws-async-tf/2005Feb/0005.html I think this is pretty similar to what you are suggesting, the main difference being the use of an initial 303 to redirect the client to the status resource and the inclusion of a Retry-After header to give the client a rough estimate of how long to wait before the first poll. I was wondering if you considered this approach instead of the initial 202 and accompanying entity body ? Marc. |
Craig McClanahan
|
Posted: July 16, 2009 18:42 by Craig McClanahan
|
|
Hi Marc, A couple of thoughts on the approach you outlined in the message at the specified link. * We don't have to worry about modifying SOAP (thank goodness sincewe're defining the initial API here. * This seems to make the client do more work (receive a 303, parse the Location header, do an extra GET) to get the initial status, and apparently every other one as well. * In some implementations of the cloud API, particular operations might actually be synchronous ... meaning you'd get the initial status representation back with progress already set to 100. Nothing extra for the client to do in that case. * You don't specify it in your example, but does the status message include the URI of itself? I would want that in order to be consistent with the rest of the Cloud API, where you never have to worry about URIs that are not visible in the representations. * Retry-After is interesting, and could be incorporated into the proposed approach if we liked it, but I'm not sure how valuable it really is. Craig |
MarcHadley
|
Posted: July 16, 2009 18:57 by MarcHadley
|
|
(I figured out why I couldn't post, I had to bookmark the project first) Taking your points in order: - Amen to that - There's is an extra GET but only after the specified Retry-After. I think the cost of parsing the Location from the HTTP header is probably lower than extracting it from the response body since you'll be parsing the headers in either case. Also, some HTTP libraries can follow the redirect automatically so it looks like one request to a client. - If the operation is synchronous then you could just return 200 OK or 204 No Content. - Yes, the status, if not complete, includes Location and Retry-After headers. - I think the utility of Retry-After depends on how well you can predict the duration of the operation. It could certainly help to cut down the frequency of polls by an eager client. Marc. |
jchalupa
|
Posted: June 10, 2009 14:03 by jchalupa
|
|
Nice proposal! Thanks for putting it together. Just very minor comments:
|
Sam Johnston
|
Posted: July 03, 2009 11:55 by Sam Johnston
|
|
This is the direction I've been pushing for OCCI too... the actuators seem elegant at first but once you start thinking about things like abandoning requests in process, monitoring progress and asynchronous events in general a "request" resource makes a lot more sense. The example I use is a backup which may not start until midnight and may take 12 hours from then to complete. Anyway this is an extract from my post "Is HTTP the HTTP of cloud computing?" (http://samj.net/2009/05/is-http-http-of-cloud-computing.html) back in May - I haven't fully codified it yet but it will likely look something like what you guys are doing (perhaps dropping stuff like the target_uri field in favour of a single Location: header with content negotiation): RESTful State Machines Something else which has not sat well with me until I spent the weekend ingesting RESTful Web Services book (by Leonard Richardson and Sam Ruby) was the "actuator" concept we picked up from the Sun Cloud APIs. This breaks away from RESTful principles by exposing an RPC-style API for triggering state changes (e.g. start, stop, restart). Granted it's an improvement on the alternative (GETting a resource and PUTting it back with an updated state) as Tim Bray explains in RESTful Casuistry (to which Roy Fielding and Bill de hÓra also responded), but it still "feels funky". Sure it doesn't make any sense to try to "force" a monitored status to some other value (for example setting a "state" attribute to "running"), especially when we can't be sure that's the state we'll get to (maybe there will be an error or the transition will be dependent on some outcome over which we have no control). Similarly it doesn't make much sense to treat states as nouns, for example adding a "running" state to a collection of states (even if a resource can be "running" and "backing up" concurrently). But is using URLs as "buttons" representing verbs/transitions the best answer? What makes more sense [to me] is to request a transition and check back for updates (e.g. by polling or HTTP server push). If it's RESTful to POST comments to an article (which in addition to its own contents acts as a collection of zero or more comments) then POSTing a request to change state to a [sub]resource also makes sense. As a bonus these can be parametrised (for example a "resize" request can be accompanied with a "size" parameter and a "stop" request sent with clarification as to whether an "ACPI Off" or "Pull Cord" is required). Transitions that take a while, like "format" on a storage resource, can simply return HTTP 201 Accepted so we've got support for asynchronous actions as well - indeed some requests (e.g. "backup") may not even be started immediately. We may also want to consider using something like Post Once Exactly (POE) to ensure that requests like "restart" aren't executed repeatedly and that we can cancel requests that the system hasn't had a chance to deal with yet. Exactly how this should look in terms of URL layout I'm not sure (perhaps http://example.com/<resource>/requests) but being able to enumerate the possible actions as well as acceptable parameters (e.g. an enum for variations on "stop" or a range for "resize") would be particularly useful for clients. |
showing 1 - 11 of 11
Replies: 10 - Last Post: July 16, 2009 18:57
by: MarcHadley
by: MarcHadley


. However, you're pretty much certain to need some persistent storage to track in flight operations (both to assign them unique URIs and to detect conflicts) -- but a simple table with a row per in flight operation seems pretty straightforward to me. Base the status_uri value on the auto-generated primary key of this table, and make the row go away when the operation is completed (successfully or unsuccessfully). This approach even survives a restart of the bridge server, as long as your database doesn't get munched.







