Powershell: Get-IMDBmovie using Invoke-Webrequest

Some time ago I was playing around a bit with Invoke-Webrequest and wanted to know a bit more about what I could do with that cmdlet. I ended up creating a cmdlet to get movie information from IMDB.com. I thought I would share how that script since the same principles can be applied in many situations.
PS C:\> Get-ImdbMovie -Title 'Star Trek'

 Title       : Star Trek Into Darkness
 Directors   : J.J. Abrams 
 Year        : 2013
 Rating      : 7.8 
 Description : After the crew of the Enterprise find an unstoppable force of terror from within their
               own organization, Captain Kirk leads a manhunt to a war-zone world to capture a one-man
               weapon of mass destruction.
 Writers     : Roberto Orci, Alex Kurtzman, 2 more credits 
 Stars       : Chris Pine, Zachary Quinto, Zoe Saldana 
 Link        : http://www.imdb.com/title/tt1408101/

PS C:\>
This script will first send a webrequest to the search page on IMBD and then parse it for results. If any results are returned then the script will pick up the link to the first movie in the result. The movie link will be used in a second request to get all the details about the movie. The complete script can be found at the bottom of this post.

Advanced function

So lets get started. First we want to create an advanced function so that we can use it like a cmdlet. The parameter Title need to be required.
function Get-ImdbMovie
{
    [CmdletBinding()]
    Param
    (
        # Enter the title of the movie you want to get information about
        [Parameter(Mandatory=$true,
                   ValueFromPipelineByPropertyName=$true,
                   Position=0)]
        [ValidateNotNullorEmpty()]
        [string]$Title
    )
    Process { 
        #this is where the rest of the code goes
    }
}
Replace any spaces with %20 so that we can use it in an URL. Other special characters could be replaced as well, but in this case we only handle spaces
$searchTitle = $Title.Replace(' ','%20')

Parse search results

Now lets fetch the search page on IMDB.com with the title in the URL.
$imdbURL = "http://www.imdb.com/search/title?title=$searchTitle&title_type=feature"
$moviesearch = Invoke-WebRequest $imdbURL
The entire web page is now stored in the $moviesearch variable. In the HTML on the IMDB site there is a class called "title" which we want to get.
$titleclassarray = $moviesearch.AllElements | where Class -eq 'title' | select -First 1
If the search did not match any movies, then the "title" class shouldnt exist and the $titleclassarray will be empty. Check if the array contains anything useful. If it does, then we can continue, otherwise just break the script. Lets double check just to be sure.
try {
       $titleclass = $titleclassarray[0]
}
catch {
      Write-Warning "No movie found matching that title.
      break
}
        
if (!($titleclass)){
      Write-Warning "No movie found matching that title.
      break
}
Now we want to get the link to the main page for the first movie in the search result. First we use some regular expression to parse the HTML to find all links. The regexp expression filters out all href tags and gets the URL in those tags. All the links found will be stored in the $linksfound variable.
$regex = "<\s*a\s*[^>]*?href\s*=\s*[`"']*([^`"'>]+)[^>]*?>"
$linksFound = [Regex]::Matches($titleclass.innerHTML, $regex, "IgnoreCase")
Any link to a movie page will contain the string "/title/" in the URL, so lets parse through all the links to see if we can find any link with that string. The results will be saved in the variable $titlelink.
$titlelink = New-Object System.Collections.ArrayList
foreach($link in $linksFound)
{
     $trimmedlink = $link.Groups[1].Value.Trim()
     if ($trimmedlink.Contains('/title/')) 
     {
           [void] $titlelink.Add($trimmedlink)
     }
}

Parse movie page

Next we want to fetch the real movie page from IMDB. The first string in $titlelink will be added to www.imdb.com. Lets fetch it in english, to make it easier. Change this to any language you want.
$movieURL = "http://www.imdb.com$($titlelink[0])"
$moviehtml = Invoke-WebRequest $movieURL -Headers @{"Accept-Language"="en-US,en;"}
Now the html for the movie page is stored in $moviehtml variable. Again we need to parse through the html to get the information we want. Next, create the movie object. This is what will be returned to the pipeline in the end.
$movie = New-Object -TypeName psobject
Search through the HTML for a tag which uses the class called 'itemprop'. If we use the innerText method we will get the text within that tab. This is where the title is. Add the title to the $movie object.
Add-Member -InputObject $movie `
            -MemberType 'NoteProperty' `
            -Name "Title" -Value `
            ($moviehtml.AllElements | where Class -eq 'itemprop' | select -First 1).innerText
Continue to search through the html for the other information that we want.
Directors is a bit more tricky. Its in a tag using the class "txt-block" but that class is used in more places so we need to dig a bit deeper and try to find a H4 tag with the class "inline" which actually contains the text "Director:". Then we need to remove some html tags around the text that we want to save.
foreach ($line in ($moviehtml.AllElements | where Class -eq 'txt-block').InnerHTML -split "`n"){
    if ($line -like '*Director:*'){
        $line = $line.Replace('</SPAN></A>','')
        Add-Member -InputObject $movie `
                   -MemberType 'NoteProperty' `
                   -Name "Directors" `
                   -Value $line.Remove(0,$line.LastIndexOf('>')+1)
    }
}
We continue like this to parse out what we need for the rest of the information.
# Year can be found in the tag using a class called "nobr" where we select the first result
# and then remove some parantheses.
Add-Member -InputObject $movie `
           -MemberType 'NoteProperty' `
           -Name "Year" `
           -Value (($moviehtml.AllElements | 
                    where Class -eq 'nobr' | 
                    select -First 1).innerText).Replace('(','').Replace(')','')

# Rating can be found in the tag using the class "titlePageSprite star-box-giga-star"
Add-Member -InputObject $movie `
           -MemberType 'NoteProperty' `
           -Name "Rating" 
           -Value ($moviehtml.AllElements | 
                    where Class -eq 'titlePageSprite star-box-giga-star' | 
                    select -First 1).innerText

# Description can be found in the tag using the class "description"
Add-Member -InputObject $movie `
           -MemberType 'NoteProperty' `
           -Name "Description" `
           -Value ($moviehtml.AllElements | 
                    where itemprop -eq 'description' | 
                    select -first 1).InnerText

# Writers
Add-Member -InputObject $movie `
           -MemberType 'NoteProperty' `
           -Name "Writers" `
           -Value ($moviehtml.AllElements | 
                    where itemprop -eq 'creator' | 
                    select -first 1).InnerText.Replace('Writers:','').Replace(' »','')

# Stars
Add-Member -InputObject $movie `
           -MemberType 'NoteProperty' `
           -Name "Stars" `
           -Value ($moviehtml.AllElements | 
                    where itemprop -eq 'actors').InnerText.Replace(
                             'Stars:','').Replace(' | See full cast and crew »','')
The last thing we want to add to the object is the link to the page where the information comes from. After that the object can be returned to the pipeline.
Add-Member -InputObject $movie -MemberType 'NoteProperty' -Name "Link" -Value $movieURL
$movie

Complete cmdlet with verbose output

<#
.Synopsis
   This scripts will get some basic information from IMDB about a movie by parcing the html code on the website.
.DESCRIPTION
   This scripts will get some basic information from IMDB about a movie by parcing the html code on the website.
.EXAMPLE
   Get-ImdbMovie -Title 'star trek'
.EXAMPLE
   Get-ImdbMovie -Title 'star trek' -verbose
.NOTES
   Created by John Roos
   http://blog.roostech.se
#>
function Get-ImdbMovie
{
    [CmdletBinding()]
    Param
    (
        # Enter the title of the movie you want to get information about
        [Parameter(Mandatory=$true,
                   ValueFromPipelineByPropertyName=$true,
                   Position=0)]
        [ValidateNotNullorEmpty()]
        [string]$Title
    )

    Process
    {
        $searchTitle = $Title.Replace(' ','%20')

        Write-Verbose "Fetching search results"
        $moviesearch = Invoke-WebRequest "http://www.imdb.com/search/title?title=$searchTitle&title_type=feature"
        
        Write-Verbose "Moving html elements into variable"
        $titleclassarray = $moviesearch.AllElements | where Class -eq 'title' | select -First 1
        
        Write-Verbose "Checking if result contains movies"
        try {
            $titleclass = $titleclassarray[0]
        }
        catch {
            Write-Warning "No movie found matching that title http://www.imdb.com/search/title?title=$searchTitle&title_type=feature"
            break
        }
        
        if (!($titleclass)){
            Write-Warning "No movie found matching that title http://www.imdb.com/search/title?title=$searchTitle&title_type=feature"
            break
        }
        
        Write-Verbose "Result contains movies."
        
        Write-Verbose "Parcing HTML for movie link."
        $regex = "<\s*a\s*[^>]*?href\s*=\s*[`"']*([^`"'>]+)[^>]*?>"
        $linksFound = [Regex]::Matches($titleclass.innerHTML, $regex, "IgnoreCase")
        
        $titlelink = New-Object System.Collections.ArrayList
        foreach($link in $linksFound)
        {
            $trimmedlink = $link.Groups[1].Value.Trim()
            if ($trimmedlink.Contains('/title/')) {
                [void] $titlelink.Add($trimmedlink)
            }
        }
        Write-Verbose "Movie link found."

        $movieURL = "http://www.imdb.com$($titlelink[0])"
        Write-Verbose "Fetching movie page."
        $moviehtml = Invoke-WebRequest $movieURL -Headers @{"Accept-Language"="en-US,en;"}
        Write-Verbose "Movie page fetched."

        $movie = New-Object -TypeName psobject

        Write-Verbose "Parcing for title."
        Add-Member -InputObject $movie -MemberType 'NoteProperty' -Name "Title" -Value ($moviehtml.AllElements | where Class -eq 'itemprop' | select -First 1).innerText

        Write-Verbose "Parcing for directors."
        foreach ($line in ($moviehtml.AllElements | where Class -eq 'txt-block').InnerHTML -split "`n"){
            if ($line -like '*Director:*'){
                $line = $line.Replace('</SPAN></A>','')
                Add-Member -InputObject $movie -MemberType 'NoteProperty' -Name "Directors" -Value $line.Remove(0,$line.LastIndexOf('>')+1)
            }
        }

        Write-Verbose "Parcing for year."
        Add-Member -InputObject $movie -MemberType 'NoteProperty' -Name "Year" -Value (($moviehtml.AllElements | where Class -eq 'nobr' | select -First 1).innerText).Replace('(','').Replace(')','')

        Write-Verbose "Parcing for rating."
        Add-Member -InputObject $movie -MemberType 'NoteProperty' -Name "Rating" -Value ($moviehtml.AllElements | where Class -eq 'titlePageSprite star-box-giga-star' | select -First 1).innerText

        Write-Verbose "Parcing for description."
        Add-Member -InputObject $movie -MemberType 'NoteProperty' -Name "Description" -Value ($moviehtml.AllElements | where itemprop -eq 'description' | select -first 1).InnerText

        Write-Verbose "Parcing for writers."
        Add-Member -InputObject $movie -MemberType 'NoteProperty' -Name "Writers" -Value ($moviehtml.AllElements | where itemprop -eq 'creator' | select -first 1).InnerText.Replace('Writers:','').Replace(' »','')

        Write-Verbose "Parcing for stars."
        Add-Member -InputObject $movie -MemberType 'NoteProperty' -Name "Stars" -Value ($moviehtml.AllElements | where itemprop -eq 'actors').InnerText.Replace('Stars:','').Replace(' | See full cast and crew »','')

        Write-Verbose "Adding the link."
        Add-Member -InputObject $movie -MemberType 'NoteProperty' -Name "Link" -Value $movieURL

        Write-Verbose "Returning object."
        $movie
    }
}

1 comment

  1. Excellent, im taking part of your function to search and rename movie files i have compared to imdb titles, thank you!

    ReplyDelete