I had one more task for Giza++. I should do some tokenization for the parallel corpus. The simplest approach was using space for separating interlunations.
Using C# regular expression, it was simple.
"str_temp = Regex.Replace(str_temp, @"\.", " . ");"
could be used for separating period.
But there were many symbols should be separated: ,./;'[]\<>?:"{}|`~!@#$%^&*()-_=+
If using the approach, it was boring.
However, there was another usage for regx. "\x09-\x15" could be used for matching lots of symbols. To do so, I should check the Asc ii table.
I found one nice table of Asc ii, as following:
You could find out the interlunations were in some areas: 21-2F, 3A-40, 5B-60, and 7B-7F, in Hex respectively.
So my solution for such task was
str_temp = Regex.Replace(str_temp, @"([\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7F])", " $1 ");
str_temp = Regex.Replace(str_temp, @" ", " ");
The second step was used for deleting redundant spaces.
Until now, I thought Asc ii was an art. How do you think?
没有评论:
发表评论