Skip navigation.

A Weekend Well SpentAll recent postsMicrosoft Goes a Long Way to Fix Games

A CMS Plugin Wanted

We're developing a mini CMS for a client, and as long as the code is in our hands, it validates at XHTML 1.0 Strict. I know for sure that once we fork the project over people will be pasting chunks of text from MS Word with its horrendous HTML. We can't really tell them, Learn proper HTML and don't use Word. Won't happen. It's like telling me to fix my car on my own. So here's what I'm looking for...

I've seen in other people's blogs MovableType plug-ins/add-ons to process comments, add <br /> instead of new lines, strip unwanted tags, convert quotes, dashes, etc. I can't find anything of this kind for .NET.

I'm looking for a PHP/Perl/whatever script that has the above functionality. I'd like to "port" it to .NET, distribute it for free and use it for my own blog. If somebody has already done this in C# or VB.NET, please let me know. Otherwise I'll appreciate pointers from my PHP/Perl readers.

Thanks!

Comments

Comment permalink 1 Humberto Oliveira |
Have you seen the Textile script implemented by TextPattern? It is written in PHP and it is very rich. I've been using a tiny version written in Javascript in my website but it lacks some functionality I would like it to have.

Among the other scripts I've seen so far, it's the easier to use by general people because it lies in the power of Regular Expressions to create a set of easy-to-remember characters that will be replaced by html tags.

If you need any help, I would be very pleased to join you in this enterprise.
Comment permalink 2 Mark Wilson |
The organisation I work for is currently mid way through writing a new in-house CMS solution for our public web site and converting all our legacy content.

The CMS uses the built in IE HTML editor with the submitted output being filtered through HtmlTidy to fix any invalid XHTML. The .NET wrapped version of HtmlTidy can be found at http://users.rcn.com/creitzel/tidy.html#dotnet

After the content has been tidied we load it into an XML object to do post-filtering of "Office crap" from the HTML then store the result in our CMS database.
Comment permalink 3 kuntek |
Chek this out: FreeTextBox maybe it will help you.
Comment permalink 4 Milan Negovan |
Humberto: thank you for the tip. I love RegEx'es. Not that they are easy to write, but they sure are very powerful.

Mark: that's an interesting approach! Thank you for sharing it.
Comment permalink 5 Milan Negovan |
Kuntek: my fear with rich edit controls of this kind is that they all produce bad markup. Even the sample two lines they've got up there have font tags. I wish XStandard wans't an ActiveX control.
Comment permalink 6 Frank Zehelein |
You can look at this site:
http://photomatt.net/scripts/

There are two php-scripts in you can look at.

I think it is rather tricky to throw out all of words-syntax and special chars like the long dash or Quotation marks and that.

I am really interested in your solution!
Comment permalink 7 Vlad Alexander |
XStandard for Mozilla/Firefox will be out shortly. Sign-up if you want to be notified when it is released or you want to beta test it.
Comment permalink 8 Scott |
Have you seen this: http://www.interactivetools.com/products/htmlarea/

Looks like it produces pretty reasonable mark-up. Not sure if you can really control/stop people from post from work :)

The other thing to look at is SGML reader, which will allow you to convert Html to XML (not really XHTML). From there you can at least warn users they are about to post crap.

-Scott
Comment permalink 9 Milan Negovan |
Vlad: signed up, thanks!

Scott: Yep, a very promising control. But we're back to the deprecated font saga. *sigh* The cleanest one so far has been XStandard. We'll see how it goes when their Mozilla counterpart is out.
Comment permalink 10 Grant |
+1 for Tidy if you're looking for robust validation. Users will find a way to get bad markup into one of the editors (and most of them are focused on the UI, getting xplat and competing for the 1.2% marketshare of which each of them dreams). Use one of the rich editors, let it do what it's going to do and then validate and post-clean it with tidy.

This wrapper below has worked a lot better for me in the past than the COM wrapper on Charles Reitzel's page:

http://sourceforge.net/projects/ntidy/
Comment permalink 11 Vlad Alexander |
Grant, help me out please. I would really like to understand the psychology of developers who use tools that generate bad markup and then use tools to clean that markup. Some developers don't see anything wrong with this and I'd like to understand why. Thanks.
Comment permalink 12 Jeff Perrin |
I use FreeTextBox, with SgmlReader to format the html into xhtml. Works great, I've been using the combination on my own site for several months now, without any problems.
Comment permalink 13 Roger Johansson |
XStandard is very good. I've used it in a couple of projects and it's very solid once you get it up and running. Excellent tech support as well. To my knowledge, our clients haven't been able to enter any invalid markup at all. The only drawback is that it doesn't support all XHTML elements (yet).

For a project based on a CMS where XStandard is not an option, I've started writing a bunch of regular expressions to clean up the mess. It won't be bulletproof, but at least I'll catch the most common problems.
Comment permalink 14 Mark Harrison |
telerik r.a.d.editor
superb control / superb support
Comment permalink 15 Grant |
Vlad -

I'd love to help you out, Vlad. Although I can't help what you were asking for since it wasn't what I recommended. (But your inane question is duly noted as well as your likely angle (But congratulations nonetheless. I build enterprise applications on behalf of a number of Fortune 500 companies, and am frequently in the market for solid third-party componentry--consider me alienated and much less likely to eagerly recommend or standardize on your products, nice work!)

I can help you out with your level of comprehension. Take the remedial step of reading my original comment, I didn't recommend using a tool that generates bad markup. I indicated that most tools I've worked with do, in fact, generate undesirable markup, so if you really want to be standards compliant in the real world, you're probably going to have bring in something stronger for cleanup and validation. Ergo, Tidy.

To save you the time of a reply extolling any virtues of your product, be advised I don't consider anything tied to IE or ActiveX to be a viable recommendation, particularly given the standards focus of this site (nor am I in the habit of prematurely recommending unreleased solutions).
Comment permalink 16 Vlad Alexander |
Dude, I have no idea where your hostility towards me is coming from. I asked a fair question related to your own statement "Use one of the rich editors, let it do what it's going to do and then validate and post-clean it with tidy." There were no digs intended - at you or other developers who make the same choice. There was no devious "angle" to my question. I am a software engineer and in those rare moments when I have break from programming I like to interact with other developers to better understand their needs, so we can improve our software and make it more relevant. We've done this in the past and it's led to great improvements in our product.

Grant, if you're ever in Toronto, stop by. I'll buy you a beer. We'll talk about Tidy.
Comment permalink 17 Grant |
Your question implied I said something I did not, namely something that would be stupid imo--specifically choosing a tool that had particularly bad markup. My reply was towards the other part of the equation: what you are likely to need to do work around the collective limitations of the current tools if you want to generate valid XHTML and also do some of the types of post-fixing Milan listed.

In any event, no one in their right mind would look to pick a tool that generated low-quality/low-compliance mark-up if there were equivalent options that yielded better quality markup. Sometimes there are requirements which override markup quality (Gecko-based browser support, non-IE, etc.). In those cases, unless you're going to throw validation out the window, you need to do something in addition to whatever the editor is or is not doing. It's pretty simple actually, which is possibly why your original question seemed disingenuous to me.

As far as Tidy goes, we've had good results with it, ymmv.
Comment permalink 18 Mounty |
If this issue is still relevant, check out r.a.d editor from telerik. It has "clean word html" which works pretty good.
Comment permalink 19 Mike Gale |
I faced a similar issue. I tried a couple of third party HTML editors (and a long time ago I even coded my own, based on MS code. Remember Triedit?). All the third party tools I tried use MSHTML and all produce markup that I find unacceptable.

I reluctantly bolted on some post processing to make it work. (I was not prepared to take longer checking out other tools, and I disagreed with the licensing pattern of one product.) One of my criteria is that it must all be .NET assemblies.

I ended up using SGMLReader, a "Word Markup Fixer" in the third party tool, and a series of filters that fix the XHTML. (I have prototype code that makes the markup follow a schema but haven't found the need to use that live yet.) The details are a very long story, which I'll not cover here. The bottom line is that this system can now round trip markup and seems robust.

So it can be done.
Comment permalink 20 Viet |
Have you tried a component called Active up Html Textbox. It has a code cleaner built-in on paste. and you can get an xhtml version of what has been entered in the html textbox.


ActiveUp
Comment permalink 21 Milan Negovan |
I've looked up the ActiveUp control, and it looked promising. At first. The contol falls flat on its face outside of IE. *sigh* Besides, those guys really need to fix their site.

The quest is still on.
Comment permalink 22 Ezequiel Espíndola |
I think the question might be off now, but as you posted your last comment the day of my last birthday =) and I got here searching for an answer of whether should I build my own CMS or not, I think I might add my 1 cent for anyone interested about WYSIWYG editors for a web application.

Some people on the Community Server forums have been talking about a replacement for FreeTextBox and recommended http://www.fckeditor.net. I think it might be the best one out there, but I haven't tryed myself yet. It has very good comments. I wonder why it seems not to be known much.

I'm planning on doing the replacement of FTB, as the control just can't work with a simople tag like without destroying all your writing, at least not on the current CS implementation.

Milan, if you take a look at it please share your results. Thank you.
Comment permalink 23 Milan Negovan |
Ezequiel, I have seen this control mentioned several times already. I will give it a try.
Comment permalink 24 jaime |
www.fckeditor.com check this...
Comment permalink 25 Josh Stodola |
Well, I typed a comment this morning, but then your server all of a sudden said I was unauthorized to view the page. I found this, it is in PHP and I don't think it's bulletproof, but he is pretty reliable when it comes to security...

http://shiflett.org/blog/2007/mar/allowing-html-and-preventing-xss

Please let me know if you reach a solution, I have been looking for one in .NET for quite some time. I don't know if I got the time or the knowledge to write one myself, although I have been thinking about it lately.

Best regards...

Emails and Notifications

Would you like to be notified when somebody responds to this post?  Would you like to have these comments emailed to you?

TrackBacks

Sorry, TrackBacks are not allowed.

Submit your comment

Please enter only text since all HTML tags except hyperlinks will be stripped. Hyperlinks will become live links. Any comments with flaming or offensive language will be deleted. Be courteous to other posters. Thank you.

Your name (required):
Your email (optional):
Your site's URL (optional):
Enter this number
Type in the number above:
Comment (required):