-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pgrst.accept
setting to RPC
#1582
Conversation
* move json ops tests to JsonOperatorSpec * move phfts operator tests to QuerySpec and RpcSpec * add custom spec for pre-request header guc tests
* Add tests for browsers img and navigation requests
Love it! Couldn't find a test-case for it - maybe it's worth adding: What happens when you do Another thing is wildcard matching: I agree with you that wildcards in the request can be a later addition. However, I think we really need to allow wildcards in |
++ rawContentTypes | ||
++ [CTOpenAPI | tpIsRootSpec target] | ||
ActionInvoke _ -> case procAccept of | ||
Just acc -> [CTOther $ toS acc] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it could easily be extended to allow multiple media types, e.g. like pgrst.accept='image/png,image/jpeg'
- would just need to convert this to a list of multiple CTOther
?
@wolfgangwalther Thanks! I've added some test cases for default and unknown types.
Oh, that's a new one. On the single mimetype case I've added now, we're in fact accepting (I got confused here with the So with
How would the user decide which mimetype to set in the body? For example, if chrome sends One thing that is great about being static and specific on the mimetype(single or list) is that it would allow us to document them on our OpenAPI output(check openapi mime types). Wildcards and setting the content type on the function body wouldn't allow us to do that. |
But that's the other way around from what I suggested, correct? So client requests
I think we should keep it consistent: The single mime type case is just the simplest case of content negotation (see below) - it's just a yes/no decision. But since it's still content negotation it should be
Yes. But a simple multiple mimetype case to start with, because for content-negotation you don't even need to make any decision.
This is the "Content Negotiation" part - this should be done in postgrest, not in the RPC. If we could extend that in the future, so that postgrest somehow tells the RPC (probably via GUC) on which of the possible mimetypes it has acted, that would be great. Example:
The multiple mimetypes I referred to are entirely based on content: I am keeping files in the database. That could be images, pdf, ... whatever the user uploads. So once a specific file is requested I know exactly which mimetype to return (the type of the file). But I need to act on any accept header. I guess right now I could rely on the browser always sending
I agree that would be great to put in the OpenAPI output. I don't think we should disallow overriding the content-type on the function body, however. Now if my function body overrides the This seems to be allowed in OpenAPI 3.x:
|
I agree, it would be a matter of being clear about the single mime case on the docs.
Great idea! In fact the parseHttpAccept already orders(q params included) the mimes in a client So, right now we have these cases:
How about if we handle the 3 pending cases in this way:
I've been checking how to do this in the code and it looks feasible. Also it would have the nice side effect of rejecting an invalid Accept without starting an empty commit on the db(currently happening). @wolfgangwalther What do you think? |
Is the "wildcard" you mention here just about a full wildcard or does it extend to "half" wildcards like Negotiation
So, just for clarification, the negotiation algorithm you're suggesting would look like this?
Alternative approachIf
This would be a bit like the algorithm that apache uses. Since this algorithm would be nicely encapsulated in a function anyway, I think it would not make the overall design more complex (just the negotiation algorithm of course :D). Return valueI think in any case we can do better than returning either always
Content-Type
+1. That makes a lot of sense - RPC functions should not be concerned about the returned content-type being
What about:
I think the first one would be a sane default, that would simplify a lot of RPCs, especially those that just accept a single specific mimetype. |
I just found this: https://wiki.postgresql.org/wiki/Inlining_of_SQL_functions It reads:
That sounds like a show-stopper for the whole What we need is another way of providing custom postgrest options for individual sql objects. Finding a more general solution here would also allow us to extend the concept here from RPCs to tables. Maybe something like a config option The table schema could be as simple as: CREATE TABLE pgrst.config (
oid OID,
config JSON
); in which one could insert: INSERT INTO pgrst.config VALUES ('ret_image(text)'::regprocedure, '{"accept":"image/png"}') Not sure if the oid approach is the best. It has the advantage, that renaming database objects is possible. Not sure what happens on Another approach is, to have |
Hmm.. I guess we'd have to measure that to see if the perf we lose is considerable. But really the main use case for I think the So since flexibility was the original motivation and not max performance, maybe the inlining issue is not an actual show-stopper. Still, I'd be interesting to clarify the real loss, I see things like The rules set out here are believed to be correct for pg versions between 8.4 and 9.5 on the wiki and it looks like the info there is not conclusive. (the wiki page is a bit complicated :O.. will need to revisit that later) |
Forgot to mention that I checked the source. Unchanged in terms of the
Without inlining all the filters and limits etc. will be applied on the materialized result of the function call. This means, especially with big amounts of data like files, that the filters have to be re-implemented in the function itself. In most cases, this will be a simple PK lookup from a function argument, in that case it will not affect performance, that's true. But using any of the query syntax for filtering or using limit and offset will quickly not be possible anymore. Re-implementing any of the filter behaviour will be much more complex than the config table :/ I am not sure how this affects performance for resource embedding. We should be able to run a few tests with the current state of this PR, right? Eh... we would need a branch for comparison that allows inlining properly. Currently it's blocked in general, because of some other conditions not being met. So maybe we should first make sure that we have queries that allow inlining for regular RPCs - could give a performance improvement as well. And once we have that, we can compare this in the specific use-case of |
See #1652 for a specific case where the consequences for performance of broken inlining are shown. |
TLDR: I'm proposing a solution that will:
This solution is a bit of work to implement - but once we get this done, this should be a major improvement and really useful. Before going into detail about what we should do, I will outline the inlining problem again, with a couple of examples that show why we need to support it for performance. InliningWhat is inlining? Why do we need to make sure that our RPC function calls can be inlined? When a function call is inlined, the function body is put directly into the main query and the whole query is parsed as one big query. Here's an example: CREATE FUNCTION search_client (query TEXT) RETURNS SETOF clients
STABLE LANGUAGE SQL AS $$
SELECT * FROM clients WHERE name LIKE '%' || query || '%'
$$;
SELECT * FROM search_client('john doe') LIMIT 10; Let's ignore for a second that to my knowledge it's not possible to add an index that would cover the If the function call is inlined the query will be treated as: SELECT * FROM (
SELECT * FROM clients WHERE name LIKE '%' || 'john doe' || '%'
) AS search_client LIMIT 10; This can be optimized as one query - so the Now assume, that the query can not be inlined. One way to do this would be to remove the See #1652 for a case where exactly this is happening. See https://wiki.postgresql.org/wiki/Inlining_of_SQL_functions for all the conditions that must be satisfied for inlining. Why do we need inlining for RPCs returning a custom mimetype? Of course there are use-cases where inlining is not important. This is the case when the resultset inside the function call is not further reduced by other paramters. That basically means: Without inlining all the query parameters for filtering and limiting will come with a performance hit for RPCs. However if we pass in a PK column via function argument and query for exactly 1 row, inlining does not matter. Here are some examples that could use custom mimetypes / accept header, but need query parameters for filtering / limiting:
For performance, we absolutely need to support inlining most of the time.
|
A couple more notes:
It makes sense to implement Regarding inlining of RPCs in general: In the majority of cases, this is not possible right now, with the way the queries are constructed. So allowing inlining can improve performance in the future, but this PR will not lead to a degradation in performance given the way we call RPCs right now. Therefore, I conclude that the approach in this PR can be continued as-is, without lowering our chances of implementing faster options on top of that. Once we have all the pieces in place, it will be mostly a documentation issue, to tell people how to write RPCs with |
@wolfgangwalther The computed columns approach is genius! 💯 🥇 💥 🤯 So simple and yet so powerful 💪. Couldn't think of a way to also make it work with a RPC returning a scalar. But as you mention, we can be clear as when not to do this on the docs. So I'll revisit and finish this one later. I'll let you implement your great idea with computed columns on another PR. Once we finish these we can call #1548 solved. |
Arghhhh :D |
😆 I thought you might like to get full credit for that one. But I'll help in reviewing :D |
Just another idea here, so we don't forget.
We can have a true generic solution by implementing the same for custom aggregate functions. Once we find an aggregate that has a finalfunc with |
The more I think about it, the more I think we should keep this simple and limit it to supporting aggregate functions only. No All other approaches have serious drawbacks:
This could be solved by using aggregate functions:
This approach does support everything we need:
Another benefit is that we don't need the whole "select one output column only" magic. Even from a table you will be able to just set your |
One advantage of not implementing
Those functions would then have to take |
If you really want to implement one specific type only (not sure how much that would simplify the implementation, while still making progress towards the ultimate goal), then I would suggest you don't add html. I think Next step would be doing the same for xml. This way we'd have a basic feature, with which we can test the general concepts without introducing breaking changes that don't have an alternative anymore. Removing csv and xml support, but only providing html instead... would break usability for existing use-cases and could be a deal-breaker. |
I do agree with the idea of reducing boilerplate and in fact I got closer to your point of view as I wrote the rest of that response. I like the conceptual simplicity of an
You are correct! Just tested this on Google SQL. Could have sworn I had tested this, my bad.
Yes, we're thinking about the same thing here. Another way to describe it is that the dev provides a set in their schema,
Right, these proposals are quite close in fact. I think these are the differences:
Apologies for how long this got but here's a deeper analysis: Another way to describe this whole thing is that content-negotiation is just search through a 2D space. The points in the space are "content types I can serve", and we place them such that X and Y axis are client and developer preference respectively, 0 being the most preferred. At first look it seems like we can't give a definitive answer for the correct solution by just saying "pick the point the shortest distance from the origin" because there may be multiple solutions at the same coordinate, and you may have equidistant solutions forming an arc. In this case we're choosing to minimise X because that's what the So that leaves the problem that here could be two Content Types on the same coordinates. Both your solution and mine resolve this by making The main difference, I think, is that data reps rely on this "chain of transformations" thing for simplicity and developer expressiveness. It's hard to make a simple example, but I'll give it a go. Suppose we have the following scenario:
My solution (data reps):
In this example, both solutions aim to find the best match between the client's requested MIME types and the available content types provided by the developer, but the outcome is different. Let me know if I misunderstood anything about your algorithm. If I did not, I propose that the data rep based solution finds a better solution with less ambiguity at each step. |
@aljungberg Thanks for explaining your approach in depth again - I have a much better undestanding of it now. Just some random thoughts on your post at first.
Aggregates are not only required for performance, but also to allow certain formats in the first place. We can't take
I think I begin to see the that your approach will end up being simpler to use.
Thanks a lot for that, it helped my understanding a lot!
After the first two sentences I was like "yeah and then we have multiple solutions at the same coordinate".. and then you just said that in your next sentence. This line of thought resonates very well with me.
It worked for me. I now understand what you meant by chain of transformations. In fact, one of the built-in handlers relies on this concept.
I see how this is much better for performance compared to my solution, because resolving the priorities happens during schema cache reload, not when serving the request.
This is the key thing that I am not sure about. Is shorter always better? Am I able to express my intent as a developer this way? Following up this post I will try to put some actual real world cases that I have right now into code and try to test the concept.
The xmlize part is really nice - not only allowing transformations between source data types and mimetypes.. but also "lateral" transformations between different mimetypes. That's the core of what you call "data representations". Elegant.
AFAIK the order of mimetypes in the
I didn't go through my own algorithm here in detail, but I'd assume that for both algorithms the same intent would just have to be expressed slightly differently. I assume it would be possible to achieve the same result, although maybe not as elegant as in your case with
I'm not sure whether there is a mis-understanding for one of us. I think we both haven't been 100% explicit about this, but after reading your post, I am left with the impression that there is one key difference in how both of would use
Let's try. I want an API to return only -- should this be required/forced by PostgREST to use BYTEA? Does it ever make sense to really use TEXT for `*/*`? BYTEA seems to be the only built-in type that can actually represent all mimetypes...
CREATE DOMAIN "*/*" AS TEXT;
CREATE DOMAIN "text/csv" AS TEXT;
-- does this mean text/csv is now my default return type?
CREATE CAST ("text/csv" AS "*/*") WITH INOUT;
CREATE AGGREGATE to_csv(any) RETURNS "text/csv"; With this, I would have the following paths available for a
In this case, the second path would be chosen, because it's shorter. But that seems more like a coincidence, because the default path just happens to have the Assuming we have a two-step process for CSV, too:
How would I now tell PostgREST that |
Trying @aljungberg's transformation chains suggestion. Example 1a: Generic files table with download function.CREATE TABLE files (
PRIMARY KEY (file),
file UUID GENERATED ALWAYS AS (md5(data)::UUID) STORED,
type TEXT GENERATED ALWAYS AS (byteamagic_mime(substr(data, 0, 4100))) STORED,
data BYTEA NOT NULL,
filename TEXT NOT NULL
);
CREATE DOMAIN "*/*" AS BYTEA;
CREATE FUNCTION download(file UUID) RETURNS "*/*"
STABLE
LANGUAGE plpgsql AS $$
BEGIN
PERFORM
set_config('response.headers', Jsonb_build_object('Content-Type', type)::TEXT, TRUE)
FROM files
WHERE files.file = download.file;
RETURN data FROM files WHERE files.file = download.file;
END$$; For a request like the following: GET /download?file=<md5> HTTP/1.1
Accept: */* There is only one path available, I think: Example 1b: Extending example 1 to return multiple files in a zip archive(Haven't implemented it like this, I'm using CREATE DOMAIN "application/zip" AS BYTEA;
CREATE AGGREGATE to_zip(files) RETURNS "application/zip";
-- pseudo: state transition function creates a zip file from each `$1.data` using `$1.filename`. This should allow me to do both of these:
GET /files?select=file,type,filename HTTP/1.1
Accept: application/json, */*
GET /files?filename=like.*important* HTTP/1.1
Accept: application/zip But this is rather inconvenient in most cases, because I'd request the file list via javascript, where I can add the accept header, and download the file directly via URL, where I can't make the browser change the header. So it would be better, if for the files endpoint only, the default would be to return the zip archive when requesting Let's try: CREATE CAST ("application/zip" AS "*/*") WITH INOUT; What would this do? In the current example this could work, because Instead we could try to change the AGGREGATE above to: CREATE AGGREGATE to_zip(files) RETURNS "*/*"; We don't need to create the Now, the
The first one is shorter and will be taken. Assuming we really had the For the I like this, so far. Example 2: Provide a report in json, csv and excelCREATE FUNCTION report(<filter arguments>) RETURNS TABLE (
who NAME,
what TEXT,
when TIMESTAMPTZ
) ...;
CREATE DOMAIN "text/csv" AS TEXT;
CREATE AGGREGATE to_csv(anycompatible) RETURNS "text/csv" ...;
CREATE DOMAIN "application/vnd.ms-excel" AS BYTEA;
CREATE AGGREGATE to_xlsx(anycompatible) RETURNS "application/vnd.ms-excel" ...; Short of implementing those aggregates, this should be simple and produce the desired result. This should even work, when using Hm, my examples so far, seem to be too simple. I think they would work equally well with both approaches? |
Edit: sorry, didn't see your second comment until I already posted this. Hopefully the below is still on point. Thanks for taking the time to try to find counterexamples! That's key to confirming this can work.
Fair enough, I didn't know functions can't take record sets as input. Is that even if you explicitly define the Throwing aggregation functions into the mix adds a bit of extra work in the discovery of the mappings and how to output the result, but luckily the broad idea remains the same. All we need to know is our
Okay, I think my longer explanation will be more helpful but yes, something like that, although I don't think you should do this for reasons I'll state below. But answering your question as stated: with data reps this cast would add a mapping of
Right so there are two mechanisms a developer can use to promote an alternative resolution. The first one you already mentioned and I believe is the most intuitive: just create casts making a shorter path. Turn Let's say that's not possible, performant or convenient. Then that's where the order of mappings by priority that I mentioned plays in. I realise I didn't fully explain it. Basically, the resolver stops when it finds a solution, any solution, and it visits potential solutions in a certain inherent order: by path length, then by mapping order. So in our example at hand, and I'm going to simplify the path finding algorithm a little here (in reality we'd explore solutions in parallel for performance reasons):
So you trumped the default solution when you did Sorry getting a little late here so I hope I didn't overlook any of your questions. You are right, there may be edge cases I didn't consider with my (Disappointed if it's true that the [1]: We need to make sure not to make any of our built-ins be accidental "shortcuts" that skips over types. One step at a time is best, like in |
Yes. One thing to consider while search all possible paths: We need a maximum of one aggregate function in each path. Sometimes, having zero aggregates could work, too - but most of the time we need exactly one aggregate.
I couldn't find those reasons. What exactly should I not do and why?
That's certainly wanted. In my example 1a in my other comment above, I would still want the
This makes it impossible to "bump" longer paths before shorter paths by manipulating anything remotely similar to "priorites" or "weights", right? My gut feeling is that, it's important to consider path length, because otherwise the number of possible solutions would just grow and it could lead to some strange solutions that the developer didn't really intend. However, at the same time, I feel like for most regular scenarios you'd be working in a range of maybe 1 to 4 nodes in a path - and it's not immediately clear to me, that in this space, shorter paths are always better than longer paths. But, I haven't found a counter example, yet, so maybe it just really works.
Thanks, that gave me a much better idea where you'd consider the order of the mappings. I will try and see to come up with examples where this is useful and find cases where we'd need to manipulate that order.
Correct, I think in practice that's what others do, anyway.
So, you're saying we actually have (I just learned that Speaking of that, to maximize composability, we should not have this full path as a built-in - but rather each edge seperately:
This would allow a developer to replace just a single edge of that chain. Examples:
I can't find a reason why we'd ever need |
Thank you for the thoughtful examples. It's encouraging to see that the model seems to work in the cases given so far.
Yes I see that now, there are just row types. Regular functions accept only rows or scalars as parameters. Can the If you bear with me, let me play with this idea for a moment using a concrete example. Let's say we want to implement a custom table->json representation. Maybe we have a settings table with unique names and instead of outputting CREATE TABLE config (
key VARCHAR PRIMARY KEY,
value VARCHAR
);
INSERT INTO config (key, value) VALUES
('key1', 'value1'),
('key2', 'value2'),
('key3', 'value3');
CREATE DOMAIN "application/json-dict" AS text;
CREATE OR REPLACE FUNCTION format_config(rows config[])
RETURNS "application/json-dict" AS $$
WITH config_rows AS (SELECT * FROM unnest(rows))
SELECT coalesce(json_object_agg(src.key, src.value), '{}')::varchar FROM config_rows AS src
$$ LANGUAGE SQL;
CREATE CAST (config[] AS"application/json-dict") WITH FUNCTION format_config AS IMPLICIT;
-- Request comes in with `Accept: application/json-dict`
WITH pgrst_source AS (SELECT ARRAY(SELECT (key, value)::config FROM config) AS inner)
SELECT format_config(_postgrest_t.inner)
FROM (SELECT inner FROM "pgrst_source") _postgrest_t;
-- { "key1" : "value1", "key2" : "value2", "key3" : "value3" } So that works for my example, and now we have a nearly fully generalised top level query which is kind of cool! Sets and scalars are the same at the highest level. And now we have a clean obvious way for the developer to say "When asked to make REP out of THING[], call function x". And it's the same way as for Although even while writing this I got doubts.
In the pursuit of simplicity in one aspect, maybe I'm actually making other things more complex and potentially degrading performance at the same time. Hmm, yes, maybe we just press forward with aggregates despite all the time I took to write this up.
Yes that would be the case. I guess in theory you could have a mapping that turned an intermediate result back into a
Yes, sorry, it ran a little late when I was writing that and I never really finished the thought. I wanted to say that I think you generally want the types you return to be as specific as possible for the matching algorithm to find the best fit. Like you said, in our We can use our generalisation rules to select a more specific type like The solution to this problem may be to have the dev make So this kind of combines static and dynamic routing, but only when explicitly described as possible by the developer, so we don't burn CPU speculatively invoking methods that can never make acceptable content.
Yep.
Right, shorter paths always win. But you can shorten your own paths if need be. Or just change your original return type and cast specifically on that. So I think in practice this apparent inflexibility in the scheme won't matter.
Yes, the intuition for why shorter paths are "right" is because you get closer to the type actually requested. Fewer transformations is better. We'll definitely want to set a maximum path length to prevent pathological behaviour, and perhaps that number is 4. As a dev using PostgREST, maybe a longer path length would allow you to implement a cleaner, more decomposed solution, but there's diminishing returns. And this limit only restricts your implementation, not outcome. You can still create an infinite number of domain types to effect any necessary transformation chain.
Yes exactly, I think this makes it both easy to understand and easier to tinker with for the dev. Just as you go on to say below.
Yes so I guess another way to conceptualise this is that since our generalisation rules go step by step, as a developer you select how "wide" your customisation is by where you insert your casts. Casting from
Hmm yes, you might be right. When doing data reps we chose (it was with you I discussed it with I think) to always require What does PostgREST actually produce?I know this discussion has become very wide and deep already but just one more thought that occurred to me: do we really output JSON by default or do we output text? Our top level expression for the output is Like we touched upon binary JSON like So how do we choose our top level cast once we're done with our transformation chain? Is there an implicit transformation somewhere in |
I think that's safe to remove now and can be ignored for this feature. It was probably added to support an older libpq or postgres version(likely related to failed Hasql decoding). Just to elaborate. postgrest/src/PostgREST/Query/SqlFragment.hs Lines 186 to 189 in 7629eff
On #2701, I experimented with adding a binary aggregation(ref) disregarding the To avoid confusion, I'll try to remove the Edit: Done on #2726. Our Hasql decoding is So I don't think we should change any assumption made for the feature discussed here so far. |
This is so awesome, I think I'm starting to grasp the idea.
Right now we don't actually support an So besides the builtin application/vnd.pgrst.object+json
builtin application/geo+json
The builtin text/csv
custom text/htmlSo for my use case of a custom 1 - Without overriding the default
It seems it could also be done by just one node?
2 - Overriding the default
Or just two nodes?
In both of these cases we wouldn't be touching any other resource because the aggregate only takes the
I really like that, not using the IMPLICIT clause makes this so much cleaner. So far it seems this ticks all the boxes. I think we can start a draft implementation. |
Yes, for every specific content type we can produce using built-in rules, we would add built-in generalisation rules. So in this case, we'd go all the way from As a practical matter, we can choose whether to make these generalisation transforms explicit or implicit. For the latter, we can just assume them in the content negotiator/data rep resolver. But I think the cleanest way is explicit: to literally add these "do nothing" mappings to the data rep type conversion maps. Then we have fewer special cases in the code and debuggability seems better because every step in the path will always be chosen from a list you can print and look at. If using explicit rules we have to take some care so we also generate the same kind of explicit type relaxation rules for user-defined types. So if the user defines (In my view, figuring out what you want versus what you have is 80% of programming, so making things concrete and inspectable is important.)
Oh yeah, perfect. So again I see this change as simplifying by formalising our data transformations. Having a built-in rule that says, "use With this change, I guess in practice we also want to add On that note:
Right so this is a great example of this feature empowering PostgREST users. They want something that works differently than our bundled
Yep. As a developer you can just go straight to it,
So in your example with [1]: Edges in the path algorithm have priorities. User-defined trump built-in. When there are multiple user-defined casts, we should always choose the last one because anything else would be unintuitive. In natural language if I say, "bananas should be represented in the style of Van Gogh" and then later say, "bananas should be represented by jazz improv", you'd assume I changed my mind. Also it fits conceptually. Built-in rules are defined "first", being present at "compile time", and replaced by user-defined rules at "run time". And just in the same way, additional user-defined rules trump what went before.[2] As a practical matter, the way we implement this is that the values in the map from type X is an insertion ordered set which we seed with our built-in rules, then process in reverse order when pathing. E.g. |
Sorry for letting this languish. I had to prioritize other issues due to work. I'm starting implementation on #2825. For now the plan is:
So all things considered seems implementation can be started on the First stage without causing breaking changes on later stages. This way we avoid analysis paralysis while also closing some issues. Seems we can use domain based media types for |
Closing as this got implemented on #3076 |
Allows setting a custom media type for the response format(
Accept
andContent-Type
headers) by using thepgrst.accept
setting on RPC.Basically, it serves as an escape hatch for when the user wants an
application/msgpack
or animage/png
response. Example:Currently it doesn't handle a wildcard media subtype(like
image/*
) but handles the all media types wildcard(*/*
). Which is good enough for cases like browser<img>
requests or right click -> view image requests(check these tests and #1462).image/*
can be handled in a later enhancement.Related issues:
charset=uft-8
on binary output. Reported on gitter chat.