June 3rd, 2010 § § permalink
We all want shortcuts.
As of the end of this month, Twitter will deprecate support for http authentication (provide username and password explicitly) in API request. The alternative is called OAuth and Twitter has a comprehensive guide here. However, it appears too complex to us, and what we need is a method as similar to http auth as possible, since for researchers like us, the sole purpose of using API request is to get data via a robot account.
Here is the simplest shortcut, illustrated in Ruby.
- Register an app at
http://dev.twitter.com/apps/new, and you’ll get a consumer key and a consumer secret. From the page “my access token”, you can find an access token and an access secret.
- Write the four strings into a config file, e.g.
$HOME/.twitter. Below is a yaml config example.
#!/usr/bin/ruby -w
# require 'pp'
require 'yaml'
CToken = "AAAAAAAAAAAAAAAAAAAAA"
CSecret = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB"
AToken = "CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC"
ASecret = "DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD"
config = {"ctoken" => CToken, "csecret"=>CSecret, "atoken"=>AToken, "asecret"=>ASecret}
File.open("#{ENV["HOME"]}/.twitter", "w") do |f|
f.puts config.to_yaml
end
- Authorize your application in a script, using twitter gem.
#!/usr/bin/ruby -w
require 'pp'
require 'yaml'
require 'twitter'
require 'mysql'
conf = {}
File.open("#{ENV["HOME"]}/.twitter", "r") { |f| conf = YAML.load(f) }
# twitter authentication
oauth = Twitter::OAuth.new(conf["ctoken"], conf["csecret"])
oauth.authorize_from_access(conf["atoken"], conf["asecret"])
client = Twitter::Base.new(oauth)
client.home_timeline.each { |tweet| pp tweet }
exit
February 14th, 2010 § § permalink
这里是一个把搜狐读书里的《蔡澜谈日本 – 日本电影》下载整编为 txt 电子书的例子。
- 首先把《蔡澜谈日本:日本电影》的首页下载并转化为 UTF-8 编码。
wget -c "http://lz.book.sohu.com/serialize-id-12171.html" -O index.raw
iconv -f GBK -t UTF-8 index.raw > index.raw.utf
mv -f index.raw.utf index.raw
- 第二步是从首页的 html 文件中找出每一章节的链接和目录名。
# find lines containing chapter links
sed -n '/<ul class="clear">/,/</ul>/p' index.raw | grep 'chapter.*html' > links.raw
# find links
awk -F 'href="' '{print $2}' links.raw | cut -d'"' -f1 | sed 's@^@http://lz.book.sohu.com/@' > chapterlinks.raw
# find chapter titles
awk -F '">' '{print $2}' links.raw | cut -d'<' -f1 | sed 's@$@.txt@' > chaptertitles.raw
# put links and titles together
paste chapterlinks.raw chaptertitles.raw > chapter_to_dl.raw
得到的一个内容如下的文件
http://lz.book.sohu.com/chapter-12171-111059829.html 片冈千惠藏.txt
http://lz.book.sohu.com/chapter-12171-111059833.html 冈崎宏三.txt
http://lz.book.sohu.com/chapter-12171-111059837.html 胜新太郎(一).txt
http://lz.book.sohu.com/chapter-12171-111059845.html 胜新太郎(二).txt
- 这一步是将 chapter_to_dl.raw 文件里第一列的链接下载并存为第二列所示的文件名。这里用到一个 awk 脚本 download.awk。然后再把每一节都从 GBK 编码转为 UTF-8 编码。
awk -f download.awk chapter_to_dl.raw
for mftxt in $(ls *.txt)
do
iconv -f GBK -t UTF-8 "$mftxt" > "$mftxt".utf
mv -f "$mftxt".utf "$mftxt"
done
- 第四步是从每个章节中的 html 文件中提取真正的文本内容。
for mftxt in $(ls *.txt)
do
sed -n '/<div .* id="txtBg">/,/</div>/p' "$mftxt" | grep '<p>' | sed 's/<[^>]*>//g;s/ /n/g' > "$mftxt".part
mv -f "$mftxt".part "$mftxt"
done
- 最后一步是将全文连接起来。
for mfchpt in $(cat chaptertitles.raw)
do
echo "$mfchpt" | sed 's/.txt$//' >> book.txt
echo >> book.txt
cat "$mfchpt" >> book.txt
echo >> book.txt
done
最后得到的这个 book.txt 便是想要的《蔡澜谈日本 – 日本电影》了,我的偏好是放在 Stanza 或者 Good Reader 里。脚本在这里。