by Keyvan Nayyeri via Keyvan Nayyeri on 9/5/2008 2:15:10 PM
This is the first part of a series of a few blog posts that I’m going to write about common string manipulation problems in .NET (especially C#) with some code snippets as their solutions with an emphasis on real world scenarios.
The first post is about splitting a text (paragraph, sentence or any piece of text) into words that are building it. This is obviously a common problem and of course, a problem that seems to be solved with some built-in string manipulation methods in .NET.
However, things are not always as easy as what they seem to be! So far I’ve seen some code snippets to accomplish this goal that relied on String.Split method and passing some common separator characters to it.
But this isn’t all the story! I can outline a few concerns about this implementation:
So you see that splitting based on a constant set of separator characters is not a thorough solution.
But what’s my solution? My solution is as simple as the following function that gets a text as string and returns an array of string words.
public static string[] SplitIntoWords(string text)
{
var delimiterString = @" ,.:;~!@#$%^&*(){}\/[]<>|'??-_+?""=";
var separators = new List<char>();
foreach (char ch in text.ToCharArray().ToList<char>())
if (char.IsSeparator(ch) || Convert.ToInt32(ch) > 2500)
separators.Add(ch);
}
delimiterString += new string(separators.ToArray());
var delimiter = delimiterString.ToCharArray();
string[] words = null;
if (!string.IsNullOrEmpty(text))
words = text.Split(delimiter, StringSplitOptions.RemoveEmptyEntries);
return words;
Let me describe this function shortly. The first step is to define a constant set of common separators that would be already familiar to you.
The second step is iterating through all the characters in the text and extracting any separator character using Char.IsSeparator function. The internal working of this function classifies characters in different groups. There is a UnicodeCategory enumeration in .NET and each character is classified in one of these groups. Char.IsSeparator method returns a true value for any character that is classified as SpaceSeparator, LineSeparator or ParagraphSeparator.
Besides, I checked for each character number and have considered it as a separator character if its number is larger than 2500. This number is a threshold for characters before starting the huge set of Eastern Asian language characters.
The third step is to merge two lists of separators and use it to split the text into its words.
This method as is, works for all the languages but doesn’t split text for Eastern Asian languages. Of course, it does exclude these languages from its result. Many of the existing implementations returns words and sentences from these languages as long string values.
Now I can use the below code to test my code snippet:
private void btnSplit_Click(object sender, EventArgs e)
var words = SplitIntoWords(this.txtText.Text);
this.txtWords.Text = string.Join("\n", words);
And it gives me my desired output:
I may write a follow up about Eastern Asian languages with some details that may be nice to know.
Original Post: How to Split a Text into Words
The content of the postings is owned by the respective author. CSharpFeeds is not responsible for the contents of the postings. This site is automatically generated and cannot be reviewed for abusive content. If you find abusive content on CSharpFeeds, please contact us. Designated trademarks and brands are the property of their respective owners. All rights reserved.