Three stages of marginal improvements are possible with tricks from np.einsum
and numexpr
module.
Stage #1 : Compute normalize outputs using sumreduction
with einsum
and then leveraging numexpr
for performing squaredroots 
import numexpr as ne
def normalize_einsum_numexpr(X):
sq_sums = np.einsum('ij,ij>i',X,X)[:,None]
return ne.evaluate('X/sqrt(sq_sums)')
This would be equivalent of normalize(X)
.
Stage #2 : Get numpy.cross equivalent with slicing using the definition of crossproduct

def numpy_cross_slicing(X,Y):
c0 = X[:,1]*Y[:,2]  X[:,2]*Y[:,1]
c1 = X[:,2]*Y[:,0]  X[:,0]*Y[:,2]
c2 = X[:,0]*Y[:,1]  X[:,1]*Y[:,0]
return np.column_stack((c0,c1,c2))
Stage #3 : Get normalized cross product with slicing and also leveraging numexpr

def numpy_cross_norm_slicing(X,Y):
c0 = X[:,1]*Y[:,2]  X[:,2]*Y[:,1]
c1 = X[:,2]*Y[:,0]  X[:,0]*Y[:,2]
c2 = X[:,0]*Y[:,1]  X[:,1]*Y[:,0]
s = ne.evaluate('sqrt(c0**2 + c1**2 + c2**2)')
c0 /= s
c1 /= s
c2 /= s
return np.column_stack((c0,c1,c2))
This would replace normalize(np.cross(X,Y))
.
Putting it all together, we would have the replacement for vectorToMatrix
, like so 
def vectorToMatrix1(X,Y):
X = normalize_einsum_numexpr(X) # make sure X is normalized
Y = normalize_einsum_numexpr(Y) # make sure Y is normalized
Z = numpy_cross_norm_slicing(X,Y) # Z is the normal ...
Y = numpy_cross_slicing(Z,X)
return np.dstack((X,Y,Z)).swapaxes(2,1)
Runtime test
Input setup :
In [271]: X = np.random.random((10**6,3))
...: Y = np.random.random((10**6,3))
...:
Stage #1 :
In [272]: np.allclose(normalize(X), normalize_einsum_numexpr(X))
Out[272]: True
In [273]: %timeit normalize(X)
...: %timeit normalize_einsum_numexpr(X)
...:
100 loops, best of 3: 11.4 ms per loop
100 loops, best of 3: 10.6 ms per loop
Stage #2 :
In [274]: np.allclose(np.cross(X,Y), numpy_cross_slicing(X,Y) )
Out[274]: True
In [275]: %timeit np.cross(X,Y)
...: %timeit numpy_cross_slicing(X,Y)
...:
10 loops, best of 3: 29.8 ms per loop
10 loops, best of 3: 27.9 ms per loop
Stage #3 :
In [276]: np.allclose(normalize(np.cross(X,Y)), numpy_cross_norm_slicing(X,Y))
Out[276]: True
In [277]: %timeit normalize(np.cross(X,Y))
...: %timeit numpy_cross_norm_slicing(X,Y)
...:
10 loops, best of 3: 44.5 ms per loop
10 loops, best of 3: 34.9 ms per loop
Entire code :
In [395]: np.allclose(vectorToMatrix(X,Y), vectorToMatrix1(X,Y))
Out[395]: True
In [396]: %timeit vectorToMatrix(X,Y)
10 loops, best of 3: 130 ms per loop
In [397]: %timeit vectorToMatrix1(X,Y)
10 loops, best of 3: 122 ms per loop
Hence, some marginal improvement only.
Not giving up!
Looking into the bottlenecks, the many stacking steps weren't helping. So, improving on those, a modified version ended up like this 
def vectorToMatrix2(X,Y):
X = normalize_einsum_numexpr(X) # make sure X is normalized
Y = normalize_einsum_numexpr(Y) # make sure Y is normalized
c0 = X[:,1]*Y[:,2]  X[:,2]*Y[:,1]
c1 = X[:,2]*Y[:,0]  X[:,0]*Y[:,2]
c2 = X[:,0]*Y[:,1]  X[:,1]*Y[:,0]
s = ne.evaluate('sqrt(c0**2 + c1**2 + c2**2)')
c0 /= s
c1 /= s
c2 /= s
d0 = c1*X[:,2]  c2*X[:,1]
d1 = c2*X[:,0]  c0*X[:,2]
d2 = c0*X[:,1]  c1*X[:,0]
c = [c0,c1,c2]
d = [d0,d1,d2]
return np.concatenate((X.T, d, c)).reshape(3,3,1).transpose(2,0,1)
New timings with same million points setup 
In [505]: %timeit vectorToMatrix(X,Y) # original code
...: %timeit vectorToMatrix1(X,Y)
...: %timeit vectorToMatrix2(X,Y)
...:
10 loops, best of 3: 130 ms per loop
10 loops, best of 3: 117 ms per loop
10 loops, best of 3: 101 ms per loop
20%+
speedup, not too bad!