Skritter | Anyone interested in Tatoeba Sentance scraper?

Newer Topic Created 13 years ago Older Topic

Anyone interested in Tatoeba Sentance scraper?

kaysik April 17th, 2012 1:02a.m.

Motivated by the recent blog post I went to get myself some more sentences. However due to my nooblyness the vast majority I find are way above my level. So what does one do in such circumstances? Well you write your own scraper of course.

I wrote it so you export you words/characters from skritter and save it to a text file. Then run the scraper and it parses all Chinese sentences with audio and checks them to make sure you know every character. If so then it downloads them the sentence along with pinyin and English translation to a text file, along with the audio mp3. The text file written can the be imported directly into anki or used for whatever else you like. I checked tatoebas legal terms and as long as I (or anyone using the tool) give them full credit for the content it's all good.

Right now it'll crash if not setup exactly right but it works enough for me. If however others would like it I can fix it properly and post it up. Let me know if you care.

funchinese April 17th, 2012 2:01a.m.

Hey kaysik, I'd very much appreciate that. I have been thinking there has to be a thing exactly like this but I have never been able to find one. It would be of enormous use to me!

tsr April 17th, 2012 2:18a.m.

kaysik - That would be awesome. I have the same problem you have and would definitely benefit from it.

Catherine :) April 17th, 2012 7:26a.m.

that sounds great!

えっと April 17th, 2012 9:25a.m.

would it work with japanese as well? seems like a great idea!

Mandarinboy April 17th, 2012 9:57a.m.

Very good idea, I just have one question. "parses all Chinese sentences with audio". Does that mean that you only got the sentences that do have audio? Those are relatively few, at least for Chinese and Japanese. I do the same with an crappy app i wrote where i can select either way since often I at least is mostly interested in the sentence it self and its structure. For beginners it could also be useful to be able to do the same from e.g. Chinesepod that do have great set phrases. I just wished we could have an Skritter API that let us use our vocabulary on line instead of having to download it for tools like this. By the way, I will gladly use your app. Thanks for taking the time to offer that to the rest of us.

kaysik April 17th, 2012 10:19a.m.

Seems like there it's definitely worth it so I should have a clean non-programmer friendly version up in the next few days. Here are a few points to keep in mind:

- Windows only. If you want *nix or mac there are already scripts available.

- Its very hacky, if Tatoeba change their site at all, then it will start failing BADLY (and probably crash).

- At the moment, yes it only looks for sentences with audio. I can add a command line argument to make it search them all but that might not make it into version 1 since I wasn't interested in that originally.

- Only does Chinese (and simplified at that) currently. I think I can make a Japanese version without a huge amount of pain. I'll get back to you on that.

- Only does Tatoeba since I specifically wanted to get sentences with audio. CP glossary has great sentences but I'm not allowed steal their audio. Also CP glossary is only search-able, I can't browse and check every sentence along the way. To make it work I'd have to search every word, then check every sentance returned by every search against the allowed characters. But since a sentance might turn up in many different searches I have to keep track of found results and skip duplicates and so forth ... Probably won't happen any time soon.

- Its pretty slow. For my 350 odd known characters searching their 36 pages of Chinese sentences with audio it took about 5 minutes to download it all. Since its not something you'll run often I personally don't care but if you have a few thousand known characters it might start taking AAAAGES ... you've been warned!

nick April 17th, 2012 5:32p.m.

Mandarinboy, what kind of API calls would you like to make to Skritter? (Briefly--just trying to understand the use cases.)

dusan April 17th, 2012 9:11p.m.

Very interesting! Can you put the code on Github (or on a similar site)?

Mandarinboy April 18th, 2012 7:16a.m.

@nick. It would be nice to get access to my words via an API. This we could use for all those tools that are developed for sentence harvesting,texts, building new words with known characters etc. I guess that if we can search the words that would cost your a lot of processing costs. I more think of being able to sync the data to a local data store when ever i use my tools. For my self that would be enough but there might be others with more needs.

StEskil April 18th, 2012 4:30p.m.

This is very interesting, because what I need are simple sentences just like this. I´m interested to try.

kaysik April 18th, 2012 8:00p.m.

Ok version 0.0.1 is up: *EDIT* removed, get v2 below *EDIT*

Unzip it wherever you like. Read the readme for full instructions and then run the exe once you've filled in your word list. If you have any issues let me know!

I'll try to get a Japanese friendly version done on the weekend, and a version that will get all sentences even those without audio (that might take longer since its kind of based around the mp3 download at the moment).

Enjoy!

StEskil April 19th, 2012 1:54a.m.

My computer (W7) refused to work saying that there´s no msvcr100.dll in the computer. There is and I even copied one to tatrip.exe directory.

Mandarinboy April 19th, 2012 2:18a.m.

@stEskil, that is probably because you do not have the Microsoft C++ redistributable package installed. This package assumes you do have it. You can download it from Microsoft: http://answers.microsoft.com/en-us/windows/forum/windows_7-windows_programs/the-program-cant-start-becuase-msvcr100dll-is/5c9d301a-2191-4edb-916e-5e4958558090 install this and it works. Note, use the Microsoft Visual C++ 2010 Redistributable Package. I just tested on one of my dev PC:s and got the same error as you and installed this and now it works.

@Kaysik, when distributing programs it is easiest to use VS installer since you then will get all the pre reqs bundled in to your MSI file. Now I will test the actual program. Thanks!

funchinese April 19th, 2012 2:18a.m.

I recieved the same message.

Mandarinboy April 19th, 2012 3:15a.m.

It works great with your default words but fails when i load my own. I do get 2 sentences and then it fails. If i run my debugger i get an error saying error missing operator: this CXX0017: Error: symbol "this" not found". It is very nice feature and I like the sentences i do get from your words. It is a great way to practice listening.

kaysik April 19th, 2012 3:35a.m.

Ahh yeah always forget the redist package >.< my bad on that folks. Mandarin boy is exactly right about the steps to fix.

As for the error are you able to email me your word list so I can fix it? Kaysik at the gmail.com. It's pretty badly written as I never originally intended to give it out so it assumes lots of exact character positions on the HTML etc. Hopefully it'll be easy to fix. If your super keen I can give you the source but it was written at 2am so you might vomit if you see it haha

Mandarinboy April 19th, 2012 3:53a.m.

Thanks Kaysik, I will mail you that in the evening my time, I have to run to the airport now. I would gladly look at your code, I love reading code. I write crappy code my self so no worry:-) Fast and ugly are my keywords. I just love solutions and this is a very nice solution to a need many of us have.

kaysik April 19th, 2012 5:57a.m.

Fixed "only gets 2 sentences" build: http://oberins.com/ptofiles/TatoebaChineseScraper_2.zip

Turns out some of my hacks were detecting comments about the translations as actual translations. It now stops parsing when its gets to the comments sections so no matter what they're saying about the sentence I'll ignore it!

Twice now I've got "Page Parsing filed: A non-blocking socket operation could not be completed immediatly." error. Can't figure out why, but I'm working on it...

StEskil April 19th, 2012 8:57a.m.

@kaysik & @ Mandarinboy
- Excellent. It´s loading the sentences now - and for my 1000+ characters/2200 items it finds a lot of sentences. Next Anki from the scratch because of a new computer.

funchinese April 19th, 2012 10:29a.m.

I still get the error message.

StEskil April 19th, 2012 5:40p.m.

I got the error message a dozen times with version2, but the program finished fine with 701 sentences, it´s certainly enough for the moment...

funchinese April 19th, 2012 7:37p.m.

I get the message: "This application hs failed to start because of MSVCP100.dll was not found."

When the command prompt shows I wait to see if it downloads the sentences, however without success. Im using Windows Vista.

kaysik April 20th, 2012 2:57a.m.

@funchinese: As mentioned above in this thread the MSCV dll error is fixable by installing the VC10 redistributable. Mandarin boy linked it above, its also linked in the readme file of v2 and because I can here it is again: http://www.microsoft.com/download/en/details.aspx?id=5555 That'll fix the missing dll error right up.

The other error about non-blocking socket operation is the one you can click past/restart and it should keep working. Hopefully I'll have to this one over the weekend and then I can look at giving people Japanese/all sentence versions next week some time!

funchinese April 21st, 2012 6:46a.m.

I cannot install the redistributable. When I click to download I get two files:

1.Microsoft Visual C++ Redistributable Package (x64)
2.Microsoft Visual C++ Redistributable Package (x86)

When I click and try to install the first one it seems to unpack itself but then I get the following error message: The setup has detected that the computer does not meet the requirements to install this software. The following blocking issues must be resolved before you can install Microsoft Visual C++ 2010 Redistributalble Setup software package.

Please resolve the following:

This setup program requires an x64 platform. It cannot be installed on this platform.

When I try to open the second file it tells me to choose a program to open it with.

I know I have Windows 32 bit but I couldnt find such a setup file to download when I tried to search for "Visual C++ Redistributable (x32)"

I am lost when it comes to computers, hopefully someone can help me!

dusan April 21st, 2012 8:57a.m.

@funchinese, try to download and install the second one.
Edit: That's weird that it doesn't open, it's a .exe file. Can you try downloading it again?

funchinese April 21st, 2012 10:32a.m.

I renamed the file to .exe and then it worked. I tried with the sample words and Im getting a lot of error messages on the way but the example sentences with the audio are amazing. I will now try with my own words and tell you the result afterwards!

funchinese April 21st, 2012 10:44a.m.

Unfortunately it doesnt work with my own words which is a pitty because it could be really useful.

I get the following message after a while:

Pto Message

Win Error: (12002) Page parsing failed: Unkown text

Catherine :) April 21st, 2012 10:53a.m.

I get these three error messages, and the sentences text file is empty :(

https://lh6.googleusercontent.com/-Tvc7d_Oqzbg/T5LJVHsrAHI/AAAAAAAAAKs/gRxXx_QquoQ/s448/Untitled.png

https://lh5.googleusercontent.com/-LJim-9qBsIU/T5LJVB6CubI/AAAAAAAAAKs/VoIlU2c6Pmw/s412/Untitled2.png

https://lh4.googleusercontent.com/-bQHRE5SmvGs/T5LJVKl502I/AAAAAAAAAKs/7cMBI77wKag/s406/Untitled3.png

kaysik April 22nd, 2012 12:57a.m.

Mind sending me your word list Catherine? My username @gmail.com? The errors I'm getting are annoying but the download completes ok so I'm just ignoring them. However if your sentence file is coming out empty clearly something else is broken. If I can get it to happen on my computer with your word file then I can fix it hopefully.

Catherine :) April 22nd, 2012 7:48a.m.

Ok have done. Just to point out, I still get those messages even using the words that came with it.
Thanks for your help!

kaysik April 22nd, 2012 8:58a.m.

@Catherine: replied back what I get with your word list.

I have some bad news about the other versions. The URL I'm parsing is: http://tatoeba.org/eng/sentences/with_audio/cmn It turns out there is no Japanese version of that page ( http://tatoeba.org/eng/sentences/with_audio/jpn doesn't exist). This means I can't make a Japanese version easily.

Secondly I had a look at just parsing all the sentences from the generic search ( http://tatoeba.org/eng/sentences/show_all_in/cmn/eng/none ) and its a totally different layout so I'd have to write an entirely new parser (even the div's have different names etc). If anyone is super keen to adapt what I have I'll email you my code, but it literally just downloads the page into a big buffer then offsets from specific strings by set offsets. probably be better to start again and write it properly. I am interested in doing this one day, but I wouldn't hold my breath for it. When I do get round to it, I will of course post it here but it might be 6 months.

nick April 22nd, 2012 9:54a.m.

kaysik, I haven't been paying too much attention to this thread, but I have a very elementary question: is there a reason you're scraping Tatoeba instead of operating off of the sentence dump csv they provide?

http://tatoeba.org/eng/download_tatoeba_example_sentences

kaysik April 22nd, 2012 10:20a.m.

I only want sentences with audio and with English translation. The CSV's they provide don't mention if the sentence has audio and don't provide the translation in the same place as the sentence, and (as far as I could tell when I looked), don't give me the pinyin.

It's true I could get the linked doc, parse both, manually merge the sentence list with the link index and then figure out which Chinese sentence link with which English sentences. However I still have to parse the site to check for audio and then get the mp3 link. If I'm parsing the site anyway may as well just get everything from the source and skip the CSV step.

At least thats how my brain worked at the time ... everything is always easy in your head :P

Catherine :) April 22nd, 2012 11:53a.m.

Ok, so I've got it all working now - a bit of faff with importing sounds into anki, but now it's done, it's awesome! I'm learning new words from my existing characters, but nothing too complex so it's a fantastic tool to accompany Skritter.
Thanks so much Kaysik!

foozlesprite May 5th, 2012 4:32p.m.

Oh gosh, I wish I was programming savvy enough to make this work for Japanese (I can do without the audio!), but sadly I'm a computer networker, not a programmer. Been hoping for a way to get compounds out of kanji I already know, or sentences from vocab I know, for quite a while now. Kudos for making this for the Chinese learners though--it's really amazing.

This forum is now read only. Please go to Skritter Discourse Forum instead to start a new conversation!

create an account

recover an account

Anyone interested in Tatoeba Sentance scraper?