Scaling factor for thin plate smoothing spline?

I have a problem which may benefit from the application of a thin plate smoothing spline.    I need to interpolate values on a 2-d grid where I have sparse data.  The thin plate kernel is the basis for the spline solution.  In 2-d the solution corresponds to

(1)   \begin{eqnarray*} \lefteqn{ \min_{\beta, \theta} \Big( \sum_{i=1}^n{ \left[ y_i - f_i( \beta, \theta) \right]^2} + } \nonumber\\ & \lambda \int_{u,v} \left( \frac{\partial^2f}{\partial u^2} \right)^2 + 2 \left( \frac{\partial^2f}{\partial u \partial v} \right)^2 + \left( \frac{\partial^2f}{\partial v^2}  \right)^2 \mathrm{d}u \mathrm{d}v \Big) \end{eqnarray*}

where y_i = y(u_i,v_i)  are data values and f_i( \beta, \theta) = f(u_i,v_i;\beta, \theta) is the model function.  Once the maximum likelihood parameter values have been determined f(\beta, \theta)  is used to generates y values at all spatial positions, (u,v).

Long story short, in the thin plate model the integral in Eq.(1) is reduced to sum over reproducing kernels

(2)   \begin{equation*} k\left( \mathbf{x}; \mathbf{x}_i \right) = \| \mathbf{x}- \mathbf{x}_i \|^2 \log \left(  \| \mathbf{x} - \mathbf{x}_i \| \right) \end{equation**}

where \mathbf{x} = \left[ u,v \right] are spatial coordinates, \mathbf{x}_i are control points and \| \mathbf{x} - \mathbf{x}_i \| is the Euclidean norm of \mathbf{x} - \mathbf{x}_i.

Referring back to Eq.(1), the minimization process determines the weight coefficient for each of the kernel functions as well as a DC offset and the coefficients for two linear terms in f( x,y).  For a given \lambda value, the number of parameter values to be estimated is equal to the number of control points plus three.  Use Generalized Cross-Validation (GCV) to determine the optimum value of \lambda.

The thin plate formulation, i.e., the reproducing kernel element of it, is nice because it allows you to determine parameter values using matrix math that you do for maximum likelihood estimation.  I get the general form of the solution.  (Follow the link above if you want the details.)  What I’m stuck is that there’s no scaling parameter in Eq.(3).  More specifically, why isn’t the kernel function

(3)   \begin{equation*} k\left( \mathbf{x}; \mathbf{x}_i, s \right) = \frac{ \| \mathbf{x}- \mathbf{x}_i \|^2 }{s^2} \log \left(  \frac{ \| \mathbf{x} - \mathbf{x}_i \| }{s} \right) \end{equation**}

Units matter.   If I’m applying thin plate splines to an image registration problem then I buy that s=1 is a natural scale and the scale term disappears.  But what if I’m working a more general problem where \mathbf{x} has units?  Eq.(3) suggests that I’ll get a different answer if I did the calculation using lat and lon for coordinates as opposed to northing and easting.  That can’t be right.  More specifically, if I go to charts #28 and #32 of Nychka’s presentation (also linked to above) I know that the elements of the W matrix have a strongly nonlinear dependence on s.   If s is treated as a parameter to be estimated then I’m not longer going to have a nice regularized regression problem to solve. The basis functions which define h(x,y) in Eq.(2) are no longer constant.  You’re varying them at the same time you’re varying the weight coefficients in order to minimize the cost function in Eq.(1).  At least that how it appears to me.  Am I missing something?  How do you get away from having to estimate that scale parameter?  Do I just try a bunch of different s values as I do \lambda values when I do GCV? Do that and then search the 2-d surface for the minimum value of the cost function?

PS  I know that \left( y_i - f_i \right) in Eq.(1) has units too and that I need to divide by a standard error to make it unitless but I can at least estimate that scaling to reasonable accuracy from experimental data.  I don’t see how I get s other than explicitly incorporate it, take the derivative of the cost function with respect to s, and include the resulting equation in the system of equations I’m solving to obtain model parameter values.  That noted, making s a parameter to be estimated will make the system of equations I need to solve unpleasantly nonlinear.

UPDATE 6/25/2014:  I think it get it now. The answer is right there in Eq.(4).  I believe you can expand Eq.(4) such that you get the expression in Eq.(2) times a constant and plus an offset.  Dropping the scale factor just changes the values of fit coefficients you determine – and makes the math a lot easier.  It shouldn’t change the best fit model function.