Twitter OAuth: a Simple Ruby Script

June 3rd, 2010 § 0 comments § permalink

We all want shortcuts.
As of the end of this month, Twitter will deprecate support for http authentication (provide username and password explicitly) in API request. The alternative is called OAuth and Twitter has a comprehensive guide here. However, it appears too complex to us, and what we need is a method as similar to http auth as possible, since for researchers like us, the sole purpose of using API request is to get data via a robot account.
Here is the simplest shortcut, illustrated in Ruby.

  1. Register an app at http://dev.twitter.com/apps/new, and you’ll get a consumer key and a consumer secret. From the page “my access token”, you can find an access token and an access secret.
  2. Write the four strings into a config file, e.g. $HOME/.twitter. Below is a yaml config example.
    #!/usr/bin/ruby -w
    # require 'pp'
    require 'yaml'
    
    CToken   = "AAAAAAAAAAAAAAAAAAAAA"
    CSecret  = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB"
    AToken   = "CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC"
    ASecret  = "DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD"
    
    config = {"ctoken" => CToken, "csecret"=>CSecret, "atoken"=>AToken, "asecret"=>ASecret}
    File.open("#{ENV["HOME"]}/.twitter", "w") do |f|
      f.puts config.to_yaml
    end
    
  3. Authorize your application in a script, using twitter gem.
    #!/usr/bin/ruby -w
    
    require 'pp'
    require 'yaml'
    require 'twitter'
    require 'mysql'
    
    conf = {}
    File.open("#{ENV["HOME"]}/.twitter", "r") { |f| conf = YAML.load(f) }
    
    # twitter authentication
    oauth  = Twitter::OAuth.new(conf["ctoken"], conf["csecret"])
    oauth.authorize_from_access(conf["atoken"], conf["asecret"])
    client = Twitter::Base.new(oauth)
    
    client.home_timeline.each { |tweet| pp tweet }
    exit
    

Update Project Page 02/13/2010: Build a txt eBook

February 14th, 2010 § 1 comment § permalink

这里是一个把搜狐读书里的《蔡澜谈日本 – 日本电影》下载整编为 txt 电子书的例子。

  1. 首先把《蔡澜谈日本:日本电影》的首页下载并转化为 UTF-8 编码。
    wget -c "http://lz.book.sohu.com/serialize-id-12171.html" -O index.raw
    iconv -f GBK -t UTF-8 index.raw > index.raw.utf
    mv -f index.raw.utf index.raw
    
  2. 第二步是从首页的 html 文件中找出每一章节的链接和目录名。
    # find lines containing chapter links
    sed -n '/<ul class="clear">/,/</ul>/p' index.raw | grep 'chapter.*html' > links.raw
    # find links
    awk -F 'href="' '{print $2}' links.raw | cut -d'"' -f1 | sed 's@^@http://lz.book.sohu.com/@' > chapterlinks.raw
    # find chapter titles
    awk -F '">' '{print $2}' links.raw | cut -d'<' -f1 | sed 's@$@.txt@' > chaptertitles.raw
    # put links and titles together
    paste chapterlinks.raw chaptertitles.raw > chapter_to_dl.raw
    

    得到的一个内容如下的文件

    http://lz.book.sohu.com/chapter-12171-111059829.html    片冈千惠藏.txt
    http://lz.book.sohu.com/chapter-12171-111059833.html    冈崎宏三.txt
    http://lz.book.sohu.com/chapter-12171-111059837.html    胜新太郎(一).txt
    http://lz.book.sohu.com/chapter-12171-111059845.html    胜新太郎(二).txt
    
  3. 这一步是将 chapter_to_dl.raw 文件里第一列的链接下载并存为第二列所示的文件名。这里用到一个 awk 脚本 download.awk。然后再把每一节都从 GBK 编码转为 UTF-8 编码。
    awk -f download.awk chapter_to_dl.raw
    for mftxt in $(ls *.txt)
    do
      iconv -f GBK -t UTF-8 "$mftxt" > "$mftxt".utf
      mv -f "$mftxt".utf "$mftxt"
    done
    
  4. 第四步是从每个章节中的 html 文件中提取真正的文本内容。
    for mftxt in $(ls *.txt)
    do
      sed -n '/<div .* id="txtBg">/,/</div>/p' "$mftxt" | grep '<p>' | sed 's/<[^>]*>//g;s/&nbsp;&nbsp;&nbsp;&nbsp;/n/g' > "$mftxt".part
      mv -f "$mftxt".part "$mftxt"
    done
    
  5. 最后一步是将全文连接起来。
    for mfchpt in $(cat chaptertitles.raw)
    do
      echo "$mfchpt" | sed 's/.txt$//' >> book.txt
      echo >> book.txt
      cat "$mfchpt" >> book.txt
      echo >> book.txt
    done
    

最后得到的这个 book.txt 便是想要的《蔡澜谈日本 – 日本电影》了,我的偏好是放在 Stanza 或者 Good Reader 里。脚本在这里