As some of my friends know, I decided to write my own blogging engine
way back when ... I figured it would be a fun way to get immersion with
RSS. I don't regret that decision but it turned out being a lot
more work than I thought.
In this weekend's update, I decided my trackback functionality needed a
facelift. I never spent much time working on it ... and it's
quite a bit more complicated to implement over a comments engine, for
example. You can
read the spec here,
but in short, trackbacks are easiest to describe as "remote
comments." It's letting other blog owners (or, potentially
news-related sites) know that you're commenting about a particular
article or post.
So the broad steps to implement a trackback engine are:
1. The Auto-Discovery phase: you want to be able to send trackbacks
automatically based on the links within your post. This one is
pretty straight forward: regex to pull out all links.
2. The Search phase: After the links are parsed, the links
themselves need to be queried (HTTP GET) to see if the trackback XML is embedded
in the response stream of the URL ... the standard format looks like this:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
<rdf:Description
rdf:about="http://www.foo.com/archive.html#foo"
dc:identifier="http://www.foo.com/archive.html#foo"dc:title="Foo Bar"
trackback:ping="http://www.foo.com/tb.cgi/5" />
</rdf:RDF>
3. The Ping Phase: Once the trackback URLs are known
(they're different from the actual link to the article), an HTTP POST is sent with the
requisite information (URL, excerpt, title, and blog name).
4. The Manual Phase. Not all blog engines support this
specification (mine didn't until today) so sometimes, a trackback URL
needs to be entered manually. This is only needed in cases where
the trackback URL is undiscoverable programmatically.
5. The Receive Phase. Receiving trackbacks has little in
common with sending them. Aside from receiving the URL, excerpt,
title, and blog name from the POST, there's a lot of anti-spam
functionality that can (and perhaps should) be built in. For
example, some engines
re-ping the incoming URL and verify a link to the post or article exists before accepting the trackback.
Sounds simple, right? It's not too bad -- but already much more
complicated than getting a commenting engine going. The first
problem I see with virtually every blog engine out there:
how come multiple trackbacks are not filtered?
Is it that difficult to detect duplicates when an engine receives a
ping? Assuming a trackback hit is stored in the database, a
composite key between the URL and Article ID would take care of that in
a hurry -- "I'm sorry, I already have a trackback for that URL."
In fact, that's exactly what I've done on my side.
Since the auto-discovery will likely happen each time the article
is saved, any previous trackbacks sent would be duplicated and
seen in the recipients blog. The initial trackbacks are typically
sent only happen when the entry is first made public, but occasionally
posts
are edited. The amount of logic client side to
keep track of who has been pinged, the response codes, etc., to simply
skirt the duplicate issue that should be implemented on the recipient's
end is
fairly intensive. (And, saving network bandwidth isn't that
reasonable of a "why.")
The second issue I ran into was a new one, so I'm sharing it here
because I'm sure someone will be searching the internet with this
problem. As part of the HTTP 1.1 protocol, the WebClient and
WebRequest methods will send an
Expect: 100-continue header for HTTP
POSTs. The logic makes sense: don't send a ton of data and
then find
out the server is rejecting the POST based on another header or if the
server issues a 302, for example. Assuming the server is ready
for the data, the server responds, "Sure, continue..." and the data
gets sent.
If the web server being posted to does not have any concept of this
header (HTTP 1.0 for example -- but I'm certain I've seen this on some
web servers that are HTTP 1.1 capable) the result is what appears to be
a timeout. If you've experienced this problem, you may have
tried GETs just to confirm that it works, you'll then do some packet
sniffing and
searching the internet. You'll be perplexed because it works on
many servers, and on some it simply appears to hang.
Obviously this could be a myriad of problems -- firewalls, routers,
proxies, malformed requests, to name a few. But because the
Expect header is sent only for POSTs, a quick check to see if a GET request
works is a strong indicator that this may be the issue.
Obviously, there's no ability to write data to the request stream with
a GET, but it's an easy test.
So, unless you're posting large amounts of data, just turn off that
pesky Expect header by setting this value before creating a WebRequest
or WebClient object:
|
System.Net.ServicePointManager.Expect100Continue = false;
|