Adapting non-ASCII content

Asked by Martin

Hello ecap dev,

The modifying adapter sample replacing "the" with "a" works well with newest 3.1 and patched adapter code. Great!

But, the method:

void Adapter::Xaction::adaptContent(std::string &chunk) const {
 // this is oversimplified; production code should worry about content
 // split by arbitrary chunk boundaries, efficiency, and other things

 // another simplification: victim does not belong to replacement
 static const std::string victim = "the";
 static const std::string replacement = "a";

 std::string::size_type pos = 0;
 while ((pos = chunk.find(victim, pos)) != std::string::npos)
  chunk.replace(pos, victim.length(), replacement);
}

only works for sites encoded in ASCII (which is very few these days.) When you try to adapt a site with any other encoding, the replacer will not find any matches.

It is my understanding that std::string is used for 1-byte character based strings, whereas std::wstring should be used for multi-byte character stings.

How would you go about handling this issue, so that the simple adaption would work on all (or at least more) encodings?

thx /Martin

Question information

Language:
English Edit question
Status:
Answered
For:
eCAP Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Alex Rousskov (rousskov) said :
#1

My understanding is that you can (and should) treat std::string chunk as an opaque buffer and interpret its contents according to the right encoding.

eCAP uses std::string for buffering content and does not assign any special meaning to the buffered body bytes. In the future, we may even replace std::string with a custom and customizable buffer class to optimize body handling.

Can you help with this problem?

Provide an answer of your own, or ask Martin for more information if necessary.

To post a message you must log in.