We had this problem as well, and while we half-solved it by putting the CA workload on its own server, I ended up making a C++ routine that did pretty much the same thing and was able to do the analysis as fast as the database could be read, piped thru a quality stage job that did the statistics on the patterns it found.
Granted this code isnt useful for unicode data. It would take a bit of extra work to allow that, but our data was not unicoded.
looks something like this.. (may need to customize the table for your data, eg I have 10 and 13 and unprintable but you may want them as white)
Code: Select all
/*
converts strings to pattern similar to quality stage/IA
B whitespace
A char uc
a char lc
9 num
P punctuation
U unprintable
N null
*/
static const unsigned char pt[256] = {
'N','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U',
'U','U','U','U','U','U','U','U','U','U','U','B','P','P','P','P','P','P','P','P',
'P','P','P','P','P','P','P','9','9','9','9','9','9','9','9','9','9','P','P','P',
'P','P','P','P','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A',
'A','A','A','A','A','A','A','A','A','A','P','P','P','P','P','P','a','a','a','a',
'a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a',
'a','a','P','P','P','P','U','U','U','U','U','U','U','U','U','U','U','U','U','U',
'U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U',
'U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U',
'U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U',
'U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U',
'U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U',
'U','U','U','U','U','U','U','U','U','U','U','U','U','U','U'
};
//this simple routine just replaces every character in a string with the character from the table above
//via a lookup. so 'X' becomes 'A', '7' becomes '9', etc.
char* pattern(char* OrigInputString)
{
static int i;
const int leng = strlen(OrigInputString);
unsigned char * InputString = new unsigned char[leng+1];
if(leng == 0)
{
InputString[0] = 'N';
InputString[1] = 0;
return (char*)InputString;
}
for(i = 0; i < leng; i++)
InputString[i] = pt[OrigInputString[i]];
InputString[leng] = 0;
return (char*)InputString;
}
Its not much of an answer, but its 'an' answer. With the high performance, we did the whole table for most of our data. No reason to sample/limit.