A brief reintroduction to Yahoo! Pipes - Part 3 of 5

picture-4.pngI started this series by providing a list of some different types of pipes. Then, I went over the basic structure of a pipe. Now, we’ll roll up our sleeves and a dig in a little deeper to improve the results we get out of a pipe.

The quality of a pipe’s results is generally referred to as the ratio of signal to noise. Noisy pipes are full of useless results that take up valuable attention time. Improving the quality of results from a pipe requires a deeper knowledge of modules as well as knowledge about the specific data with which your pipe deals.

Filtering negative matches

Negative matches are results that do not meet the intended output of a pipe. For example, a persistent search pipe for the acronym SOA would return results for Service Oriented Architecture, Society Of Actuaries, School of the Americas, and the Sarbanes-Oxley Act. A person looking for Service Oriented Architecture would be constantly sifting through distracting results if they didn’t filter out the negative matches.

In the case of a persistent search, some engines support advanced operators such as the “-” (minus) symbol in front of a term to omit results that contain that term. However, if your source feed is not a search or it’s from an engine that doesn’t support advanced operators you can use the magical filter module.

Here are a couple of exploded views of the filter module (drop down menus are opened to reveal contents):
filter-module-block.jpg
filter-module-any.jpg
filter-module-rules.jpg
filter-module-condition.jpg

Filter modules can support multiple rules on the same module. The syntax for the rules is very similar to the ones used in MS Outlook (or Entourage for the Mac). In the case of our SOA example, here’s what a filter would look like for the person wanting information on Service Oriented Architecture:
filtering-soa.jpg

I used the “any” condition because a match for any one of the three rules is enough to warrant excluding the results from my pipe’s output.

Duplicate filter

When combining data from multiple sources, it’s common to end up with duplicates in your feed. The unique module is a super easy way to eliminate duplicates and save time. I tend to use this module twice—once to filter items with the same link, and then another time to filter items with the same title. I filter items with the same link because that means the items are the same page, which I don’t need more than once. I filter items with the same title because it’s common for sites like Google News to pick up a story from another source and use the same headline, but it will be a different link. Filtering duplicates improves the signal to noise ratio of any multi-source feed.

Here’s what my dual unique module set-up looks like:
duplicate-filter.jpg

Normalizing dates to chronologically merge results for multiple sources

While it may seem strange to include this piece, the truth is that it is very important. All pipes have a single output, yet most have a multiple inputs. Thus, merging results into a single feed is a very important step. Much of the time people want the information from the various sources to be sorted in chronological order. The standards that define RSS were not put through a rigorous standards development process, so sometimes the format for the item dates differ from source to source, which means the sort module won’t work properly.

To normalize the dates, you’ll need to use the loop and string builder modules. Here is what the modules look like properly configured:
date-normalizer.jpg

Not all aggregated feeds will need to do this step in order for the sort module to do it's magic. Only use this technique if your merged results are not collated. If you notice the results are stacking, then take a look at the source for each feed and check for discrepancies in the item publication dates. You'll need to copy the string parameters I have in the string builder module, which is %Y-%m-%dT%T (date parameters decoder)

Using Dapper to create feeds where none previously existed

Pipes is a great way to aggregate, manipulate, and mash up feeds. But, what if the data you want isn’t available as a feed, such as the Technorati Top Searches? Dapper is a great service that will allow you to pull the data from any web page into a feed. That means you are not limited to creating pipes from feeds because now any web page can be a feed source. In the next post from this series I’m going to share models for monetizing pipes, and using Dappr to create a feed of valuable information may be the competitive advantage you need to profit off of a pipe’s output.

For more information on Dapper, check out Dapper: The Quest To Unlock Web Data (Legally)

Ok, that’s 3 out of the 5 parts in this series. Don’t forget to check out the last two parts:

  • Pipes for profit
  • The future of pipes

I just wait to read the “Pipes for profit” part.

From Dan N. Moldovan on January 10th, 2008 at 4:49 am

Hi Justin - great post about Pipes. I had mentioned how pipes can be productive to create unique content in one of my posts at 10e20 Blog. While I’m not a coder/programmer, I do believe that Pipes can be a productive way to produce outstanding content to meet end-user’s needs. Keep us up to date!

From Jake on January 10th, 2008 at 9:41 am

Dan, that’s coming out tonight! ;)

Jake, I think one of the best uses for Pipes is intelligence gathering for content production (and syndication).

From Justin on January 10th, 2008 at 10:13 am

Pipes has its own Fetch Page module for getting HTML pages. Processing can be quite tricky however, so it’s not for the faint-hearted.

From hapdaniel on January 17th, 2008 at 4:53 am

I haven’t played much with their Fetch Page module. How would you say it compares with Dapper in terms of power and ease of use?

From Justin on January 17th, 2008 at 10:19 am

I haven’t played much with Dapper, so comparisons will be difficult. I did try to recreate a couple of my pipes using Dapper, and failed. I think the problem was that I could not locate the advanced features for field selection as shown in the demos. I struggled to select fields for an hour and a half and then gave up.
Pipes and the Fetch Page module should be at least as powerful as Dapper, if not more so. With Pipes the user has more direct access to the HTML code. The downside of this is that some understanding of regular expressions is required to extract element values.
This pipe shows the processing typically required when using the Fetch Page module:
http://pipes.yahoo.com/pipes/pipe.info?_id=754c259e8d530ea94bf0bad53c60e5b7
By the way, I forgot to mention that Pipes can also output KML.

From hapdaniel on January 19th, 2008 at 3:49 am

That was helpful, hapdaniel. I’m going to delve into the Fetch Page module more.

From Justin on January 20th, 2008 at 2:24 am

I’m confused. On one hand you’re writing a very detailed description of what is actually a pretty simple set of functions, but when you come to something complex you go into shorthand mode, i.e. “To normalize the dates, you’ll need to use the loop and string builder modules. Here is what the modules look like properly configured”

“Normalizing dates to chronologically merge results for multiple sources” would have been a very nice thing to step through.

Did I miss it somewhere? I’ve read the tutorials in order.

From Ben Tremblay on January 31st, 2008 at 3:48 pm

Hey Ben, are you able to see the graphic below the part that reads:

“To normalize the dates, you’ll need to use the loop and string builder modules. Here is what the modules look like properly configured”

I used a arrows and corresponding text to explain the way the modules had to be configured to normalize the dates.

From Justin on February 2nd, 2008 at 8:31 pm

“are you able to see the graphic below the part that reads”
Yes, I surely did. And that’s the point; here you’ve only (over-)specified what I was saying: what would easily benefit from a walk through you (rather perfunctorily) very neatly extracted to a sentence.

What really should be promoted is: that normalization is required. A simple Sort by puBdate and all its variants is a wast’o’ time.

thanks
will follow this up

^5
bdt

From Ben Tremblay on February 3rd, 2008 at 3:57 am

I think I understand what you are suggesting, Ben. You would have liked to see more explanation for why one would want to normalize dates at all. This part does that:

While it may seem strange to include this piece, the truth is that it is very important. All pipes have a single output, yet most have a multiple inputs. Thus, merging results into a single feed is a very important step. Much of the time people want the information from the various sources to be sorted in chronological order. The standards that define RSS were not put through a rigorous standards development process, so sometimes the format for the item dates differ from source to source, which means the sort module won’t work properly.

What I can see now is missing is the use of the phrase “normalize dates” in that explaining paragraph to make the connection with the sentence:

To normalize the dates, you’ll need to use the loop and string builder modules. Here is what the modules look like properly configured

Am I understanding you correctly? Does this answer what you were looking for? I pay close attention when I’m writing to making the things I write as accessible as possible, but I spend a lot of time with this material, so sometimes it’s hard to see what things need more clarity.

From Justin on February 3rd, 2008 at 1:26 pm

Well, I don’t know how it’s so confusing: I thought that “use the loop and string builider” would have been better with a text description of the configuration, is all.
Those modules are easily as complex and/or confusing as others you’ve described at greater length.

I’m not accustomed to peering at an image in order to read data or instructions.

Are we having language problems here? I18N?

From Ben Tremblay on February 3rd, 2008 at 1:37 pm

Hi - I was over at pipes trying to make an optimised version of a new copy and was headed over here. I peeked the source for my new feet (http://snipurl.com/1z2zy) and, to my surprise, found nothing obviously wrong! So at least in some case pubDate does the trick.

cheers

p.s. “In some cases” … I’ve got one pipe that uses at least a couple of feeds with some really crappy format, so this is going to come in handy sooner than later. thanks again

From Ben Tremblay on February 6th, 2008 at 1:00 am

What say you about all of this?

Trackback URL Comment feed

You must be logged in to post a comment.

Clicky Web Analytics