(Matt
Fowles is a Senior Software Engineer for StreamBase, whose interests
include programming language, compiler, and virtual machine design.)
As the ponytail (and the tiny bio above) imply, my
posts will focus on technical details such as language semantics,
general programming issues, compiler design, or anything else I feel is
nifty.
In language design, the most powerful and elegant features come from
having simple, consistent semantics. Over several posts, I plan to
explore and expand on this idea through concrete examples encountered
while developing both StreamSQL Text and StreamSQL EventFlow. In this
post, I will explain how adding hierarchical data to our language
forced us to simplify and greatly improve our wildcard rules.
In StreamBase 6.0, we added support for hierarchical data in the
form of a tuple data type. This, among other things, allows us to
reuse common structures in multiple different schemas:
CREATE SCHEMA offer_details (
symbol string,
num_shares double,
share_price double,
user_id string
);
CREATE SCHEMA top_of_book (
best_bid offer_details,
best_ask offer_details
);
CREATE SCHEMA purchase (
seller_id string,
buyer_id string,
details offer_details
);
This, of course, helps clean up the code and makes later
refactoring easier, but data is only useful if it can be easily
transformed and manipulated. Clearly we are going to want to be able to
- create nested tuples
- extract data from nested tuples
all with a syntax that people will find intuitive. Fortunately, the
debate over syntax for accessing hierarchical data has long been
settled in the programming language world; simply use . like most other languages (also we already use . to specify the source of a field when it is ambiguous from context). We also provide a tuple function to create nested tuples:
CREATE INPUT STREAM input_purchases purchase;
CREATE STREAM account_transfers;
SELECT
details.num_shares * details.share_price AS amount,
tuple(
seller_id AS destination,
buyer_id AS source
) AS accounts
FROM input_purchases INTO account_transfers;
and that satisfies our original requirements. Of course, it doesn't
provide us with a way to quickly manipulate the entirety of a nested
tuple, nor does it allow us to easily combine two sub-tuples into a
larger one. I find it is usually better from a design standpoint if I
can extend and strengthen an existing concept instead of trying to
create a new one. Luckily, * comes to our rescue. Since
StreamSQL Text already has a simple syntax for accessing multiple
fields at once, we will simply strengthen the wildcard expressions that
we had before.
Let's start with some simple definitions. Consider the following statement:
SELECT f(*) AS prefix_*_postfix FROM in INTO out;
A Wildcard Literal is just the asterisk token, *.
A Wildcard Rule is the entire expression, including both expression and target (in this case f(*) AS prefix_*_postfix).
A Wildcard Expression is the portion of the expression before the AS keyword (in this case f(*) ).
A Wildcard Target is the portion of the expression after the AS keyword (in this case prefix_*_postfix). A wildcard target is a * token with an optional string prefix and postfix. For example, foo, *bar, and foo*bar are all valid targets.
In cases where the context is clear, I may drop wildcard from the above names and simply refer to them as rule, expression, and target (I will try to take care as expression is a heavily used term). With these definitions out of the way, let's return to the discussion of improving wildcard rules.
Prior to StreamBase 6.0, wildcard rules held a fairly restricted
place in the StreamSQL Text grammar. A wildcard expression could be
either a wildcard literal (i.e. *) or a function with only a wildcard literal argument (i.e. firstval(*)). The ability to call functions on wildcards was, in fact, a
bit of a hack included for passing all the fields in a tuple to
aggregate functions.
SELECT
firstval(*) AS first_*,
lastval(*) AS last_*
FROM in[SIZE 4 TUPLES] INTO out
While the feature worked with functions other than firstval and lastval,
they were the real reason that it was there. As always, when extending
an existing feature, we should try to simplify and maintain
consistency. Fortunately, the original rules for wildcards (while
simple) were not particularly consistent in terms of fitting into the
StreamBase Expression Language. So the first step is to find a better
definition for Wildcard Expressions that doesn't include any special
cases like "only argument" or "wildcard literal".
A wildcard expression can be any expression in the expression language that uses * as a variable name (possibly in several locations). For example, in.details.*, f(g(*)), and f(*) + g(*) + 2 are all valid wildcard expressions.
Part of the appeal of this definition, is that it reduces wildcards
to a case that we already know how to handle, namely variable names.
So there should be relatively little surprise about where it is legal
to put a *.
This alone provides us with enough power to flatten nested tuples easily:
CREATE STREAM input_book top_of_book;
CREATE STREAM flattened_book;
SELECT
best_bid.* AS bid_*,
best_ask.* AS ask_*
FROM input_book INTO flattened_book;
I confess, a contrived example, but it is really just for
illustrative purposes. But it doesn't quite allow us to combine two
tuples into a bigger one. Before StreamBase 6.0, wildcard rules were
legal only in the target lists for select statements. Just like
earlier, a simpler rule without any special cases solves our problem:
Wildcard rules are legal anywhere a list of expressions is legal.
Simple, easy to explain; sounds good so far. What does this buy us
that we couldn't do before? Now, wildcard rules can be used to pass
arguments to a function! For example:
CREATE INPUT STREAM foo (int x, int y, int z);
SELECT min(* AS *) AS min FROM foo INTO bar;
expands to
CREATE INPUT STREAM foo (x int, y int, z int);
SELECT min(x AS x, y AS y, z AS z) AS min FROM foo INTO bar;
Alright, that is a fairly contrived example too, but the real reason
we wanted it was to combine two separate sub-tuples into one big one.
SELECT
tuple(
sub1.* AS *,
sub2.* AS *
) AS combined
FROM in INTO out;
As a result of adding hierarchical data to our language, we ended up
with the ability to use wildcards in compound expressions and as
arguments to functions. All while simplifying the semantics for
wildcard rules from something with several special cases and provisos
into something already well understood. Whenever I can accomplish more with less, I know that I am on the right
track.